Welcome to enhancesa’s documentation!¶
Enhancesa is a collection of tools for more simplified statistical analysis in Python. It primarily aids in manual analysis and prediction tasks that use packages like Statsmodels and Scikit-learn in their workflow.
For example, Enhancesa provides answers to questions like: Which subset of features gives me the lowest error rate in an ordinary least squares model? What are estimates of population mean and standard deviation using bootstrap resampling? And etc.
Guide¶
Module reference¶
Function | Description |
---|---|
bootstrap | Estimate population mu and SE of a sample with boothstrap subset selection method. |
diag_plots | Produce the four R-style OLS diagnostics plots. |
SubsetSelect | Goes through all features and finds the ones that are best predictors of \(y\). |
Bootstrap resampling¶
-
enhancesa.bootstrap.
bootstrap
(X, iters=1)[source]¶ Estimate population mu and SE of a sample with boothstrap subset selection method. For a quick intro, got here.
Parameters: - X (an array/series object) – A fitted Statsmodels ols model.
- iters (int, optional) – The number of resampling iterations. Usually a large value, e.g. 1000
Returns: Contains estimated population mean and stadnard deviation of \(n\) samples from the the given
x
sample.Return type: DataFrame or Series object
Examples
>>> x = np.random.normal(size=100) >>> enhancesa.bootstrap(x, iters=1000) Estimated mean: -0.025309 Estimated SE: 0.095531 dtype: float64
Diagnostic plots for an OLS model¶
-
enhancesa.diag_plots.
diag_plots
(model, y)[source]¶ Produce the four R-style OLS diagnostics plots.
Parameters: - model (Statsmodels.api.ols object) – A fitted Statsmodels ols model.
- y (numpy array, pandas series/dataframe) – The response/target variable of the model.
Returns: A 2-by-2 figure containing four diagnostics plots.
Return type: matplotlib.pyplot figure
Examples
>>> # Generate data with numpy >>> x = np.random.uniform(size=100) >>> y = 2 + 0.5*x + np.random.normal(size=100) >>> # Put into a pandas df because of Statsmodels requirement >>> df = pd.DataFrame(data={'x':x, 'y', y}) >>> # Create the ols model from statsmodels.formula.api >>> model = ols('y ~ x', data=df).fit() >>> # Create the plots >>> enhancesa.diag_plots(model, y)
Subset selection¶
-
class
enhancesa.
SubsetSelect
(method='best')[source]¶ Bases:
object
Goes through all features and finds the ones that are best predictors of a response \(y\).
Parameters: method (str, default='best') – Subset selection method. Currently implemented subset selection methods are best
,forward
stepwise, andbackward
stepwise.Methods
fit
(self, X, y)Fits a subset selection method to the data. -
fit
(self, X, y)[source]¶ Fits a subset selection method to the data.
Parameters: - X (a multidimensional array or dataframe object) – This is X predictor variables.
- y (an array or Series object) – The target or response variable.
Returns: A dataframe with the best models selected by the given
method
parameter and their corresponding residual sum of squares (RSS).Return type: DataFrame object
Examples
>>> from enhancesa.subset_selection import SubsetSelect >>> from sklearn.preprocessing import PolynomialFeatures >>> # Generate data >>> X = np.random.normal(size=100) >>> y = 0.5 + 2*X - 5*(X**2) + 3*(X**3) + np.random.normal(size=100) >>> # Make it a model with polynomial features >>> poly = PolynomialFeatures(degree=10, include_bias=False) >>> X_arr = poly.fit_transform(X[:, np.newaxis]) >>> # Put them in a dataframe, coz SubsetSelect accepts dataframe only (yet) >>> col_names = ['Y']+['X'+ str(i) for i in range(1, 11)] >>> df = pd.DataFrame(np.concatenate((y[:, np.newaxis], X_arr), axis=1), columns=col_names) >>> subsets = SubsetSelect(method='best').fit(df.iloc[:,1:], df.iloc[:,0]) 100%|██████████| 10/10 [00:05<00:00, 1.97it/s]
-
License¶
The MIT License (MIT)
Copyright (c) 2019 Ali Sina
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Installation¶
Enhancesa can be installed from the PyPI package repository.
$ pip install enhancesa
Alternatively, you can download it from the source on Github.
Quick glimpse¶
>>> import numpy as np
>>> import enhancesa as esa
>>> # Create some dummy data
>>> x = np.random.normal(size=100)
>>> # Compute test statistics with bootstrap resampling
>>> esa.bootstrap(x, iters=1000)
Estimated mean: -0.025309
Estimated SE: 0.095531
dtype: float64
Upcoming features¶
- Partial least squares (PLS) regression
- Principal components regression (PCR)
- Subset selection plots
- Additional test statistics in bootstrap resampling