Download a PDF version of the documentation: pyhdfe.pdf
PyHDFE¶
The documentation can be navigated with the sidebar, the links below, or the index.
Introduction¶
Note
This package is in beta. In future versions, the API may change substantially. Please use the GitHub issue tracker to report bugs or to request features.
PyHDFE is a Python 3 implementation of algorithms for absorbing high dimensional fixed effects. This package was created by Jeff Gortmaker in collaboration with Anya Tarascina.
What PyHDFE won’t do is provide a convenient interface for running regressions. Instead, the package is meant to be incorporated into statistical projects that would benefit from performant fixed effect absorption. Another goal is facilitating fair comparison of algorithms that have been previously implemented in various languages with different convergence criteria.
Development of the package has been guided by code made publicly available by many researchers and practitioners. For a full list of papers and software cited in this documentation, refer to the references section of the documentation.
Installation¶
The PyHDFE package has been tested on Python versions 3.6 through 3.9. The SciPy instructions for installing related packages is a good guide for how to install a scientific Python environment. A good choice is the Anaconda Distribution, since, along with many other packages that are useful for scientific computing, it comes packaged with PyHDFE’s only required dependencies: NumPy and SciPy.
You can install the current release of PyHDFE with pip:
pip install pyhdfe
You can upgrade to a newer release with the --upgrade
flag:
pip install --upgrade pyhdfe
If you lack permissions, you can install PyHDFE in your user directory with the --user
flag:
pip install --user pyhdfe
Alternatively, you can download a wheel or source archive from PyPI. You can find the latest development code on GitHub and the latest development documentation here.
Bugs and Requests¶
Please use the GitHub issue tracker to submit bugs or to request features.
API Documentation¶
Algorithms for absorbing fixed effects should be created with the following function.
|
Initialize an algorithm for absorbing fixed effects. |
Algorithm classes contain information about the fixed effects.
Algorithm for absorbing fixed effects. |
They can be used to absorb fixed effects (i.e., residualize matrices).
|
Absorb the fixed effects into a matrix and return the residuals from a regression of each column of the matrix on the fixed effects. |
Tutorial¶
This section uses a series of Jupyter Notebooks to demonstrate how pyhdfe can be used together with regression routines from other packages. Each notebook employs the Frisch-Waugh-Lovell (FWL) theorem of Frisch and Waugh (1933) and Lovell (1963) to run a fixed effects regression by residualizing (projecting) the variables of interest.
This tutorial is just meant to demonstrate how pyhdfe can be used in the simplest of applications. For detailed information about the different algorithms supported by pyhdfe, refer to API Documentation.
Download the Jupyter Notebook for this section: sklearn.ipynb
scikit-learn¶
[1]:
import pyhdfe
import numpy as np
from sklearn import datasets, linear_model
pyhdfe.__version__
[1]:
'0.2.0'
In this tutorial, we’ll use the boston data set from scikit-learn to demonstrate how pyhdfe can be used to absorb fixed effects before running regressions.
First, load the data set and create a matrix of fixed effect IDs. We’ll use a dummy for the Charles river and an index of accessibility to radial highways.
[2]:
boston = datasets.load_boston().data
ids = boston[:, [3, 8]]
ids
C:\Programs\Anaconda\envs\pyhdfe\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.
The Boston housing prices dataset has an ethical problem. You can refer to
the documentation of this function for further details.
The scikit-learn maintainers therefore strongly discourage the use of this
dataset unless the purpose of the code is to study and educate about
ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original
source::
import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
Alternative datasets include the California housing dataset (i.e.
:func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
dataset. You can load the datasets as follows::
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
for the California housing dataset and::
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
for the Ames housing dataset.
warnings.warn(msg, category=FutureWarning)
[2]:
array([[0., 1.],
[0., 2.],
[0., 2.],
...,
[0., 1.],
[0., 1.],
[0., 1.]])
Next, choose our variables: per capita crime rate, proportion of residential land zoned for lots over 25,000 square feet, and proportion of non-retail business acres per town.
[3]:
variables = boston[:, :3]
variables
[3]:
array([[6.3200e-03, 1.8000e+01, 2.3100e+00],
[2.7310e-02, 0.0000e+00, 7.0700e+00],
[2.7290e-02, 0.0000e+00, 7.0700e+00],
...,
[6.0760e-02, 0.0000e+00, 1.1930e+01],
[1.0959e-01, 0.0000e+00, 1.1930e+01],
[4.7410e-02, 0.0000e+00, 1.1930e+01]])
The create function initializes an Algorithm for fixed effect absorption that can residualize matrices with Algorithm.residualize. We’ll use the default algorithm. You may want to try other algorithms if it takes a long time to absorb fixed effects into your data.
[4]:
algorithm = pyhdfe.create(ids)
residualized = algorithm.residualize(variables)
residualized
[4]:
array([[-1.08723516e-01, -2.20167195e+01, -2.65583593e+00],
[-5.59754167e-02, -2.04166667e+01, -2.56083333e+00],
[-5.59954167e-02, -2.04166667e+01, -2.56083333e+00],
...,
[-5.42835164e-02, -4.00167195e+01, 6.96416407e+00],
[-5.45351644e-03, -4.00167195e+01, 6.96416407e+00],
[-6.76335164e-02, -4.00167195e+01, 6.96416407e+00]])
We can now run a regression of per capita crime rate on the other two variables and our fixed effects.
[5]:
y = residualized[:, [0]]
X = residualized[:, 1:]
regression = linear_model.LinearRegression()
regression.fit(X, y)
regression.coef_
[5]:
array([[-6.97058632e-05, 5.53038164e-02]])
Download the Jupyter Notebook for this section: statsmodels.ipynb
statsmodels¶
[1]:
import pyhdfe
import numpy as np
import statsmodels.api as sm
from sklearn import datasets
pyhdfe.__version__
[1]:
'0.2.0'
In this tutorial, we’ll use the boston data set from scikit-learn to demonstrate how pyhdfe can be used to absorb fixed effects before running regressions with statsmodels. We’ll also demonstrate how pyhdfe can be used to compute degrees of freedom used by fixed effects.
First, load the data set and create a matrix of fixed effect IDs. We’ll use a dummy for the Charles river and an index of accessibility to radial highways.
[2]:
boston = datasets.load_boston().data
ids = boston[:, [3, 8]]
ids
C:\Programs\Anaconda\envs\pyhdfe\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.
The Boston housing prices dataset has an ethical problem. You can refer to
the documentation of this function for further details.
The scikit-learn maintainers therefore strongly discourage the use of this
dataset unless the purpose of the code is to study and educate about
ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original
source::
import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
Alternative datasets include the California housing dataset (i.e.
:func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
dataset. You can load the datasets as follows::
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
for the California housing dataset and::
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
for the Ames housing dataset.
warnings.warn(msg, category=FutureWarning)
[2]:
array([[0., 1.],
[0., 2.],
[0., 2.],
...,
[0., 1.],
[0., 1.],
[0., 1.]])
Next, choose our variables: per capita crime rate, proportion of residential land zoned for lots over 25,000 square feet, and proportion of non-retail business acres per town.
[3]:
variables = boston[:, :3]
variables
[3]:
array([[6.3200e-03, 1.8000e+01, 2.3100e+00],
[2.7310e-02, 0.0000e+00, 7.0700e+00],
[2.7290e-02, 0.0000e+00, 7.0700e+00],
...,
[6.0760e-02, 0.0000e+00, 1.1930e+01],
[1.0959e-01, 0.0000e+00, 1.1930e+01],
[4.7410e-02, 0.0000e+00, 1.1930e+01]])
The create function initializes an Algorithm for fixed effect absorption that can residualize matrices with Algorithm.residualize. We’ll use the default algorithm. You may want to try other algorithms if it takes a long time to absorb fixed effects into your data.
[4]:
algorithm = pyhdfe.create(ids)
residualized = algorithm.residualize(variables)
residualized
[4]:
array([[-1.08723516e-01, -2.20167195e+01, -2.65583593e+00],
[-5.59754167e-02, -2.04166667e+01, -2.56083333e+00],
[-5.59954167e-02, -2.04166667e+01, -2.56083333e+00],
...,
[-5.42835164e-02, -4.00167195e+01, 6.96416407e+00],
[-5.45351644e-03, -4.00167195e+01, 6.96416407e+00],
[-6.76335164e-02, -4.00167195e+01, 6.96416407e+00]])
We can now run a regression of per capita crime rate on the other two variables and our fixed effects.
[5]:
y = residualized[:, [0]]
X = residualized[:, 1:]
ols = sm.OLS(y, X)
result = ols.fit()
result.params
[5]:
array([-6.97058632e-05, 5.53038164e-02])
Standard errors can be adjusted to account for the degrees of freedom that are lost because of the fixed effects. By default, fixed effect degrees of freedom are computed when create initializes an algorithm and are stored in Algorithm.degrees.
[6]:
se = result.HC0_se
se
[6]:
array([0.00109298, 0.00962226])
[7]:
se_adjusted = np.sqrt(np.square(se) * result.df_resid / (result.df_resid - algorithm.degrees))
se_adjusted
[7]:
array([0.00110398, 0.00971916])
References¶
Papers¶
Abowd, Creecy, and Kramarz (2002)¶
Abowd, John M., Robert H. Creecy, and Francis Kramarz (2002). Computing person and firm effects using linked longitudinal employer-employee data. Longitudinal Employer-Household Dynamics Technical Papers 2002-06, Center for Economic Studies, U.S. Census Bureau.
Correia (2015)¶
Correia, Sergio (2015). Singletons, cluster-robust standard errors and fixed effects: A bad mix. Technical Note, Duke University.
Correia (2017)¶
Correia, Sergio (2017). Linear models with high-dimensional fixed effects: An efficient and feasible estimator. Working Paper.
Fong and Saunders (2011)¶
Fong, David Chin-Lung, and Michael Saunders (2011). LSMR: An iterative algorithm for sparse least-squares problems. SIAM Journal on Scientific Computing, 33 (5), 2950–2971.
Frisch and Waugh (1933)¶
Frisch, Ragnar, and Frederick V. Waugh (1933). Partial time regressions as compared with individual trends. Econometrica, 1 (4), 387-401.
Gaure (2013a)¶
Gaure, Simen (2013a). OLS with multiple high dimensional category variables. Computational Statistics & Data Analysis, 66 (0), 8-18.
Gaure (2013b)¶
Gaure, Simen (2013b). lfe: Linear group fixed effects. The R Journal, 5 (2), 104-117.
Gearhart and Koshy (1989)¶
Gearhart, William B., and Mathew Koshy (1989). Acceleration schemes for the method of alternating projections. Journal of Computational and Applied Mathematics, 26 (3), 235-249.
Guimarães and Portugal (2010)¶
Guimarães, Paulo, and Pedro Portugal (2010). A simple feasible procedure to fit models with high-dimensional fixed effects. Stata Journal, 10 (4), 628-649.
Hernández-Ramos, Escalante, and Raydan (2011)¶
Hernández-Ramos, Luis M., René Escalante, and Marcos Raydan (2011). Unconstrained optimization techniques for the acceleration of alternating projection methods. Numerical Functional Analysis and Optimization, 32 (10), 1041-1066.
Lovell (1963)¶
Lovell, Michael C. (1963). Seasonal adjustment of economic time series and multiple regression analysis. Journal of the American Statistical Association, 58 (304), 993-1010.
Somaini and Wolak (2016)¶
Somaini, Paulo, and Frank A. Wolak (2016). An algorithm to estimate the two-way fixed effects model. Journal of Econometric Methods, 5 (1), 143-152.
Software¶
FixedEffectModels.jl¶
FixedEffectModels.jl. Julia. Matthieu Gomez. Implements a version of Guimarães and Portugal (2010), Gaure (2013a), Gaure (2013b), and Correia (2017).
lfe¶
lfe. R. Simen Gaure. Implements Guimarães and Portugal (2010), Gaure (2013a), and Gaure (2013b).
reghdfe¶
reghdfe. Stata. Sergio Correia. Implements Correia (2017), which augments Guimarães and Portugal (2010), Gaure (2013a), and Gaure (2013b).
res2fe¶
res2fe. Matlab, SAS, and Stata. Paulo Somaini and Frank Wolak. Implements Somaini and Wolak (2016).
Legal¶
Copyright 2019 Jeff Gortmaker and Anya Tarascina
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Contributing¶
Please use the GitHub issue tracker to report bugs or to request features. Contributions are welcome. Examples include:
Code optimizations.
Documentation improvements.
Alternate algorithms that have been implemented in the literature but not in PyHDFE.
Testing¶
Testing is done with the tox automation tool, which runs a pytest-backed test suite in the tests
module.
Testing Requirements¶
In addition to the installation requirements for the package itself, running tests and building documentation requires additional packages specified by the tests
and docs
extras in setup.py
, along with any other explicitly specified deps
in tox.ini
.
Running Tests¶
Defined in tox.ini
are environments that test the package under different python versions, check types, enforce style guidelines, verify the integrity of the documentation, and release the package. The following command can be run in the top-level pyfwl directory to run all testing environments:
tox
You can choose to run only one environment, such as the one that builds the documentation, with the -e
flag:
tox -e docs
Test Organization¶
Fixtures, which are defined in tests.conftest
, configure the testing environment and load data according to a range of specifications.
Tests in tests.test_hdfe
verify that different algorithms yield the same solutions.
Version Notes¶
These notes will only include major changes.
0.2¶
Initial support for weights.
0.1¶
Initial release.