Download a PDF version of the documentation: pyhdfe.pdf

PyHDFE

The documentation can be navigated with the sidebar, the links below, or the index.

Introduction

Note

This package is in beta. In future versions, the API may change substantially. Please use the GitHub issue tracker to report bugs or to request features.

PyHDFE is a Python 3 implementation of algorithms for absorbing high dimensional fixed effects. This package was created by Jeff Gortmaker in collaboration with Anya Tarascina.

What PyHDFE won’t do is provide a convenient interface for running regressions. Instead, the package is meant to be incorporated into statistical projects that would benefit from performant fixed effect absorption. Another goal is facilitating fair comparison of algorithms that have been previously implemented in various languages with different convergence criteria.

Development of the package has been guided by code made publicly available by many researchers and practitioners. For a full list of papers and software cited in this documentation, refer to the references section of the documentation.

Installation

The PyHDFE package has been tested on Python versions 3.6 through 3.9. The SciPy instructions for installing related packages is a good guide for how to install a scientific Python environment. A good choice is the Anaconda Distribution, since, along with many other packages that are useful for scientific computing, it comes packaged with PyHDFE’s only required dependencies: NumPy and SciPy.

You can install the current release of PyHDFE with pip:

pip install pyhdfe

You can upgrade to a newer release with the --upgrade flag:

pip install --upgrade pyhdfe

If you lack permissions, you can install PyHDFE in your user directory with the --user flag:

pip install --user pyhdfe

Alternatively, you can download a wheel or source archive from PyPI. You can find the latest development code on GitHub and the latest development documentation here.

Bugs and Requests

Please use the GitHub issue tracker to submit bugs or to request features.

API Documentation

Algorithms for absorbing fixed effects should be created with the following function.

create(ids[, cluster_ids, drop_singletons, …])

Initialize an algorithm for absorbing fixed effects.

Algorithm classes contain information about the fixed effects.

Algorithm

Algorithm for absorbing fixed effects.

They can be used to absorb fixed effects (i.e., residualize matrices).

Algorithm.residualize(matrix[, weights, errors])

Absorb the fixed effects into a matrix and return the residuals from a regression of each column of the matrix on the fixed effects.

Tutorial

This section uses a series of Jupyter Notebooks to demonstrate how pyhdfe can be used together with regression routines from other packages. Each notebook employs the Frisch-Waugh-Lovell (FWL) theorem of Frisch and Waugh (1933) and Lovell (1963) to run a fixed effects regression by residualizing (projecting) the variables of interest.

This tutorial is just meant to demonstrate how pyhdfe can be used in the simplest of applications. For detailed information about the different algorithms supported by pyhdfe, refer to API Documentation.

Download the Jupyter Notebook for this section: sklearn.ipynb

scikit-learn

[1]:
import pyhdfe
import numpy as np
from sklearn import datasets, linear_model

pyhdfe.__version__
[1]:
'0.2.0'

In this tutorial, we’ll use the boston data set from scikit-learn to demonstrate how pyhdfe can be used to absorb fixed effects before running regressions.

First, load the data set and create a matrix of fixed effect IDs. We’ll use a dummy for the Charles river and an index of accessibility to radial highways.

[2]:
boston = datasets.load_boston().data
ids = boston[:, [3, 8]]
ids
C:\Programs\Anaconda\envs\pyhdfe\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.

    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_housing
        housing = fetch_california_housing()

    for the California housing dataset and::

        from sklearn.datasets import fetch_openml
        housing = fetch_openml(name="house_prices", as_frame=True)

    for the Ames housing dataset.
  warnings.warn(msg, category=FutureWarning)
[2]:
array([[0., 1.],
       [0., 2.],
       [0., 2.],
       ...,
       [0., 1.],
       [0., 1.],
       [0., 1.]])

Next, choose our variables: per capita crime rate, proportion of residential land zoned for lots over 25,000 square feet, and proportion of non-retail business acres per town.

[3]:
variables = boston[:, :3]
variables
[3]:
array([[6.3200e-03, 1.8000e+01, 2.3100e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01],
       [1.0959e-01, 0.0000e+00, 1.1930e+01],
       [4.7410e-02, 0.0000e+00, 1.1930e+01]])

The create function initializes an Algorithm for fixed effect absorption that can residualize matrices with Algorithm.residualize. We’ll use the default algorithm. You may want to try other algorithms if it takes a long time to absorb fixed effects into your data.

[4]:
algorithm = pyhdfe.create(ids)
residualized = algorithm.residualize(variables)
residualized
[4]:
array([[-1.08723516e-01, -2.20167195e+01, -2.65583593e+00],
       [-5.59754167e-02, -2.04166667e+01, -2.56083333e+00],
       [-5.59954167e-02, -2.04166667e+01, -2.56083333e+00],
       ...,
       [-5.42835164e-02, -4.00167195e+01,  6.96416407e+00],
       [-5.45351644e-03, -4.00167195e+01,  6.96416407e+00],
       [-6.76335164e-02, -4.00167195e+01,  6.96416407e+00]])

We can now run a regression of per capita crime rate on the other two variables and our fixed effects.

[5]:
y = residualized[:, [0]]
X = residualized[:, 1:]
regression = linear_model.LinearRegression()
regression.fit(X, y)
regression.coef_
[5]:
array([[-6.97058632e-05,  5.53038164e-02]])

Download the Jupyter Notebook for this section: statsmodels.ipynb

statsmodels

[1]:
import pyhdfe
import numpy as np
import statsmodels.api as sm
from sklearn import datasets

pyhdfe.__version__
[1]:
'0.2.0'

In this tutorial, we’ll use the boston data set from scikit-learn to demonstrate how pyhdfe can be used to absorb fixed effects before running regressions with statsmodels. We’ll also demonstrate how pyhdfe can be used to compute degrees of freedom used by fixed effects.

First, load the data set and create a matrix of fixed effect IDs. We’ll use a dummy for the Charles river and an index of accessibility to radial highways.

[2]:
boston = datasets.load_boston().data
ids = boston[:, [3, 8]]
ids
C:\Programs\Anaconda\envs\pyhdfe\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.

    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_housing
        housing = fetch_california_housing()

    for the California housing dataset and::

        from sklearn.datasets import fetch_openml
        housing = fetch_openml(name="house_prices", as_frame=True)

    for the Ames housing dataset.
  warnings.warn(msg, category=FutureWarning)
[2]:
array([[0., 1.],
       [0., 2.],
       [0., 2.],
       ...,
       [0., 1.],
       [0., 1.],
       [0., 1.]])

Next, choose our variables: per capita crime rate, proportion of residential land zoned for lots over 25,000 square feet, and proportion of non-retail business acres per town.

[3]:
variables = boston[:, :3]
variables
[3]:
array([[6.3200e-03, 1.8000e+01, 2.3100e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01],
       [1.0959e-01, 0.0000e+00, 1.1930e+01],
       [4.7410e-02, 0.0000e+00, 1.1930e+01]])

The create function initializes an Algorithm for fixed effect absorption that can residualize matrices with Algorithm.residualize. We’ll use the default algorithm. You may want to try other algorithms if it takes a long time to absorb fixed effects into your data.

[4]:
algorithm = pyhdfe.create(ids)
residualized = algorithm.residualize(variables)
residualized
[4]:
array([[-1.08723516e-01, -2.20167195e+01, -2.65583593e+00],
       [-5.59754167e-02, -2.04166667e+01, -2.56083333e+00],
       [-5.59954167e-02, -2.04166667e+01, -2.56083333e+00],
       ...,
       [-5.42835164e-02, -4.00167195e+01,  6.96416407e+00],
       [-5.45351644e-03, -4.00167195e+01,  6.96416407e+00],
       [-6.76335164e-02, -4.00167195e+01,  6.96416407e+00]])

We can now run a regression of per capita crime rate on the other two variables and our fixed effects.

[5]:
y = residualized[:, [0]]
X = residualized[:, 1:]
ols = sm.OLS(y, X)
result = ols.fit()
result.params
[5]:
array([-6.97058632e-05,  5.53038164e-02])

Standard errors can be adjusted to account for the degrees of freedom that are lost because of the fixed effects. By default, fixed effect degrees of freedom are computed when create initializes an algorithm and are stored in Algorithm.degrees.

[6]:
se = result.HC0_se
se
[6]:
array([0.00109298, 0.00962226])
[7]:
se_adjusted = np.sqrt(np.square(se) * result.df_resid / (result.df_resid - algorithm.degrees))
se_adjusted
[7]:
array([0.00110398, 0.00971916])

References

Papers

Abowd, Creecy, and Kramarz (2002)

Abowd, John M., Robert H. Creecy, and Francis Kramarz (2002). Computing person and firm effects using linked longitudinal employer-employee data. Longitudinal Employer-Household Dynamics Technical Papers 2002-06, Center for Economic Studies, U.S. Census Bureau.

Correia (2015)

Correia, Sergio (2015). Singletons, cluster-robust standard errors and fixed effects: A bad mix. Technical Note, Duke University.

Correia (2017)

Correia, Sergio (2017). Linear models with high-dimensional fixed effects: An efficient and feasible estimator. Working Paper.

Fong and Saunders (2011)

Fong, David Chin-Lung, and Michael Saunders (2011). LSMR: An iterative algorithm for sparse least-squares problems. SIAM Journal on Scientific Computing, 33 (5), 2950–2971.

Frisch and Waugh (1933)

Frisch, Ragnar, and Frederick V. Waugh (1933). Partial time regressions as compared with individual trends. Econometrica, 1 (4), 387-401.

Gaure (2013a)

Gaure, Simen (2013a). OLS with multiple high dimensional category variables. Computational Statistics & Data Analysis, 66 (0), 8-18.

Gaure (2013b)

Gaure, Simen (2013b). lfe: Linear group fixed effects. The R Journal, 5 (2), 104-117.

Gearhart and Koshy (1989)

Gearhart, William B., and Mathew Koshy (1989). Acceleration schemes for the method of alternating projections. Journal of Computational and Applied Mathematics, 26 (3), 235-249.

Guimarães and Portugal (2010)

Guimarães, Paulo, and Pedro Portugal (2010). A simple feasible procedure to fit models with high-dimensional fixed effects. Stata Journal, 10 (4), 628-649.

Hernández-Ramos, Escalante, and Raydan (2011)

Hernández-Ramos, Luis M., René Escalante, and Marcos Raydan (2011). Unconstrained optimization techniques for the acceleration of alternating projection methods. Numerical Functional Analysis and Optimization, 32 (10), 1041-1066.

Lovell (1963)

Lovell, Michael C. (1963). Seasonal adjustment of economic time series and multiple regression analysis. Journal of the American Statistical Association, 58 (304), 993-1010.

Somaini and Wolak (2016)

Somaini, Paulo, and Frank A. Wolak (2016). An algorithm to estimate the two-way fixed effects model. Journal of Econometric Methods, 5 (1), 143-152.

Software

FixedEffectModels.jl

FixedEffectModels.jl. Julia. Matthieu Gomez. Implements a version of Guimarães and Portugal (2010), Gaure (2013a), Gaure (2013b), and Correia (2017).

lfe

lfe. R. Simen Gaure. Implements Guimarães and Portugal (2010), Gaure (2013a), and Gaure (2013b).

reghdfe

reghdfe. Stata. Sergio Correia. Implements Correia (2017), which augments Guimarães and Portugal (2010), Gaure (2013a), and Gaure (2013b).

res2fe

res2fe. Matlab, SAS, and Stata. Paulo Somaini and Frank Wolak. Implements Somaini and Wolak (2016).

Contributing

Please use the GitHub issue tracker to report bugs or to request features. Contributions are welcome. Examples include:

  • Code optimizations.

  • Documentation improvements.

  • Alternate algorithms that have been implemented in the literature but not in PyHDFE.

Testing

Testing is done with the tox automation tool, which runs a pytest-backed test suite in the tests module.

Testing Requirements

In addition to the installation requirements for the package itself, running tests and building documentation requires additional packages specified by the tests and docs extras in setup.py, along with any other explicitly specified deps in tox.ini.

Running Tests

Defined in tox.ini are environments that test the package under different python versions, check types, enforce style guidelines, verify the integrity of the documentation, and release the package. The following command can be run in the top-level pyfwl directory to run all testing environments:

tox

You can choose to run only one environment, such as the one that builds the documentation, with the -e flag:

tox -e docs

Test Organization

Fixtures, which are defined in tests.conftest, configure the testing environment and load data according to a range of specifications.

Tests in tests.test_hdfe verify that different algorithms yield the same solutions.

Version Notes

These notes will only include major changes.

0.2

  • Initial support for weights.

0.1

  • Initial release.