pyuplift documentation

pyuplift is a scientific uplift modeling library. It implements variable selection and transformation approaches. pyuplift provides API for work with such an uplift datasets as Hillstrom Email Marketing and Criteo Uplift Prediction.

Contents

Installation Guide

Install from PyPI

pip install pyuplift

Install from source code

python setup.py install

Examples of Usage

This section contains official examples of usage pyuplift package.

Contribute to pyuplift

Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users.

Guidelines

Submit Pull Request

  • Before submit, please rebase your code on the most recent version of master, you can do it by

    git remote add upstream https://github.com/duketemon/pyuplift
    git fetch upstream
    git rebase upstream/master
    
  • If you have multiple small commits, it might be good to merge them together(use git rebase then squash) into more meaningful groups.

  • Send the pull request!

    • Fix the problems reported by automatic checks
    • If you are contributing a new module, consider add a testcase

Git Workflow Howtos

How to resolve conflict with master
  • First rebase to most recent master

    # The first two steps can be skipped after you do it once.
    git remote add upstream https://github.com/duketemon/pyuplift
    git fetch upstream
    git rebase upstream/master
    
  • The git may show some conflicts it cannot merge, say conflicted.py.

    • Manually modify the file to resolve the conflict.

    • After you resolved the conflict, mark it as resolved by

      git add conflicted.py
      
  • Then you can continue rebase by

    git rebase --continue
    
  • Finally push to your fork, you may need to force push here.

    git push --force
    
How to combine multiple commits into one

Sometimes we want to combine multiple commits, especially when later commits are only fixes to previous ones, to create a PR with set of meaningful commits. You can do it by following steps.

  • Before doing so, configure the default editor of git if you haven’t done so before.

    git config core.editor the-editor-you-like
    
  • Assume we want to merge last 3 commits, type the following commands

    git rebase -i HEAD~3
    
  • It will pop up an text editor. Set the first commit as pick, and change later ones to squash.

  • After you saved the file, it will pop up another text editor to ask you modify the combined commit message.

  • Push the changes to your fork, you need to force push.

    git push --force
    
What is the consequence of force push

The previous two tips requires force push, this is because we altered the path of the commits. It is fine to force push to your own fork, as long as the commits changed are only yours.

Documents

  • Documentation is built using sphinx.
  • Each document is written in reStructuredText.
  • You can build document locally to see the effect.

Base Model

The base class for all uplift estimators.

Note

This class should not be used directly. Use derived classes instead.

Variable Selection

The pyuplift.variable_selection module includes classes which belongs to variable selection group of approaches.

Two Model

The class which implements the two model approach [1].

Parameters
no_treatment_model : object, optional (default=sklearn.linear_model.LinearRegression)
The regression model which will be used for predict uplift.
has_treatment_model : object, optional (default=sklearn.linear_model.LinearRegression)
The regression model which will be used for predict uplift.
Methods
fit(self, X, y, t) Build a two model model from the training set (X, y, t).
predict(self, X, t=None) Predict an uplift for X.
fit(self, X, y, t)

Build a model model model from the training set (X, y, t).

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
y: numpy array with shape = [n_samples,]
Array of target of feature.
t: numpy array with shape = [n_samples,]
Array of treatments.
Returns self : object
predict(self, X, t=None)

Predict an uplift for X.

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
t: numpy array with shape = [n_samples,] or None
Array of treatments.
Returns
self : object
The predicted values.
References
  1. A Literature Survey and Experimental Evaluation of the State-of-the-Art in Uplift Modeling: A Stepping Stone Toward the Development of Prescriptive Analytics by Floris Devriendt, Darie Moldovan, and Wouter Verbeke
from pyuplift.variable_selection import TwoModel
...
model = TwoModel()
model.fit(X[train_indexes, :], y[train_indexes], t[train_indexes])
uplift = model.predict(X[test_indexes, :])
print(uplift)

Econometric

The class which implements the econometric approach [1].

Parameters
model : object, optional (default=sklearn.linear_model.LinearRegression)
The regression model which will be used for predict uplift.
Methods
fit(self, X, y, t) Build an econometric model from the training set (X, y, t).
predict(self, X, t=None) Predict an uplift for X.
fit(self, X, y, t)

Build an econometric model from the training set (X, y, t).

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
y: numpy array with shape = [n_samples,]
Array of target of feature.
t: numpy array with shape = [n_samples,]
Array of treatments.
Returns self : object
predict(self, X, t=None)

Predict an uplift for X.

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
t: numpy array with shape = [n_samples,] or None
Array of treatments.
Returns
self : object
The predicted values.
References
  1. A Literature Survey and Experimental Evaluation of the State-of-the-Art in Uplift Modeling: A Stepping Stone Toward the Development of Prescriptive Analytics by Floris Devriendt, Darie Moldovan, and Wouter Verbeke
from pyuplift.variable_selection import Econometric
...
model = Econometric()
model.fit(X[train_indexes, :], y[train_indexes], t[train_indexes])
uplift = model.predict(X[test_indexes, :])
print(uplift)

Dummy

The class which implements the dummy approach [1].

Parameters
model : object, optional (default=sklearn.linear_model.LinearRegression)
The regression model which will be used for predict uplift.
Methods
fit(self, X, y, t) Build a dummy model from the training set (X, y, t).
predict(self, X, t=None) Predict an uplift for X.
fit(self, X, y, t)

Build a dummy model from the training set (X, y, t).

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
y: numpy array with shape = [n_samples,]
Array of target of feature.
t: numpy array with shape = [n_samples,]
Array of treatments.
Returns self : object
predict(self, X, t=None)

Predict an uplift for X.

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
t: numpy array with shape = [n_samples,] or None
Array of treatments.
Returns
self : object
The predicted values.
References
  1. A Literature Survey and Experimental Evaluation of the State-of-the-Art in Uplift Modeling: A Stepping Stone Toward the Development of Prescriptive Analytics by Floris Devriendt, Darie Moldovan, and Wouter Verbeke
from pyuplift.variable_selection import Dummy
...
model = Dummy()
model.fit(X[train_indexes, :], y[train_indexes], t[train_indexes])
uplift = model.predict(X[test_indexes, :])
print(uplift)

Cadit

The class which implements the cadit approach [1].

Parameters
model : object, optional (default=sklearn.linear_model.LinearRegression)
The regression model which will be used for predict uplift.
Methods
fit(self, X, y, t) Build a model from the training set (X, y, t).
predict(self, X, t=None) Predict an uplift for X.
fit(self, X, y, t)

Build a model from the training set (X, y, t).

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
y: numpy array with shape = [n_samples,]
Array of target of feature.
t: numpy array with shape = [n_samples,]
Array of treatments.
Returns self : object
predict(self, X, t=None)

Predict an uplift for X.

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
t: numpy array with shape = [n_samples,] or None
Array of treatments.
Returns
self : object
The predicted values.
References
  1. Weisberg HI, Pontes VP. Post hoc subgroups in clinical trials: Anathema or analytics? // Clinical trials. 2015 Aug;12(4):357-64.
from pyuplift.variable_selection import Cadit
...
model = Cadit()
model.fit(X[train_indexes, :], y[train_indexes], t[train_indexes])
uplift = model.predict(X[test_indexes, :])
print(uplift)
variable_selection.TwoModel([no_treatment_model, has_treatment_model]) A two model approach.
variable_selection.Econometric([model]) An econometric approach.
variable_selection.Dummy([model]) A dummy approach.
variable_selection.Cadit([model]) A cadit approach.

Transformation

The pyuplift.transformation module includes classes which belongs to a transformation group of approaches.

Transformation Base Model

The base class for a transformation uplift estimators.

Note

This class should not be used directly. Use derived classes instead.

Lai

The class which implements the Lai’s approach [1].

Parameters
model : object, optional (default=sklearn.linear_model.LogisticRegression)
The classification model which will be used for predict uplift.
use_weights : boolean, optional (default=False)
Use or not weights?
Methods
fit(self, X, y, t) Build a the model from the training set (X, y, t).
predict(self, X, t=None) Predict an uplift for X.
fit(self, X, y, t)

Build a the model from the training set (X, y, t).

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
y: numpy array with shape = [n_samples,]
Array of target of feature.
t: numpy array with shape = [n_samples,]
Array of treatments.
Returns self : object
predict(self, X, t=None)

Predict an uplift for X.

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
t: numpy array with shape = [n_samples,] or None
Array of treatments.
Returns
self : object
The predicted values.
References
  1. A Literature Survey and Experimental Evaluation of the State-of-the-Art in Uplift Modeling: A Stepping Stone Toward the Development of Prescriptive Analytics by Floris Devriendt, Darie Moldovan, and Wouter Verbeke
from pyuplift.transformation import Lai
...
model = Lai()
model.fit(X[train_indexes, :], y[train_indexes], t[train_indexes])
uplift = model.predict(X[test_indexes, :])
print(uplift)

Kane

The class which implements the Kane’s approach [1].

Parameters
model : object, optional (default=sklearn.linear_model.LogisticRegression)
The classification model which will be used for predict uplift.
use_weights : boolean, optional (default=False)
Use or not weights?
Methods
fit(self, X, y, t) Build the model from the training set (X, y, t).
predict(self, X, t=None) Predict an uplift for X.
fit(self, X, y, t)

Build the model from the training set (X, y, t).

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
y: numpy array with shape = [n_samples,]
Array of target of feature.
t: numpy array with shape = [n_samples,]
Array of treatments.
Returns self : object
predict(self, X, t=None)

Predict an uplift for X.

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
t: numpy array with shape = [n_samples,] or None
Array of treatments.
Returns
self : object
The predicted values.
References
  1. A Literature Survey and Experimental Evaluation of the State-of-the-Art in Uplift Modeling: A Stepping Stone Toward the Development of Prescriptive Analytics by Floris Devriendt, Darie Moldovan, and Wouter Verbeke
from pyuplift.transformation import Kane
...
model = Kane()
model.fit(X[train_indexes, :], y[train_indexes], t[train_indexes])
uplift = model.predict(X[test_indexes, :])
print(uplift)

Jaskowski

The class which implements the Jaskowski’s approach [1].

Parameters
model : object, optional (default=sklearn.linear_model.LogisticRegression)
The classification model which will be used for predict uplift.
Methods
fit(self, X, y, t) Build the model from the training set (X, y, t).
predict(self, X, t=None) Predict an uplift for X.
fit(self, X, y, t)

Build the model from the training set (X, y, t).

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
y: numpy array with shape = [n_samples,]
Array of target of feature.
t: numpy array with shape = [n_samples,]
Array of treatments.
Returns self : object
predict(self, X, t=None)

Predict an uplift for X.

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
t: numpy array with shape = [n_samples,] or None
Array of treatments.
Returns
self : object
The predicted values.
References
  1. A Literature Survey and Experimental Evaluation of the State-of-the-Art in Uplift Modeling: A Stepping Stone Toward the Development of Prescriptive Analytics by Floris Devriendt, Darie Moldovan, and Wouter Verbeke
from pyuplift.transformation import Jaskowski
...
model = Jaskowski()
model.fit(X[train_indexes, :], y[train_indexes], t[train_indexes])
uplift = model.predict(X[test_indexes, :])
print(uplift)

Pessimistic

The class which implements the pessimistic approach [1].

Parameters
model : object, optional (default=sklearn.linear_model.LogisticRegression)
The classification model which will be used for predict uplift.
Methods
fit(self, X, y, t) Build the model from the training set (X, y, t).
predict(self, X, t=None) Predict an uplift for X.
fit(self, X, y, t)

Build the model from the training set (X, y, t).

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
y: numpy array with shape = [n_samples,]
Array of target of feature.
t: numpy array with shape = [n_samples,]
Array of treatments.
Returns self : object
predict(self, X, t=None)

Predict an uplift for X.

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
t: numpy array with shape = [n_samples,] or None
Array of treatments.
Returns
self : object
The predicted values.
References
  1. A Literature Survey and Experimental Evaluation of the State-of-the-Art in Uplift Modeling: A Stepping Stone Toward the Development of Prescriptive Analytics by Floris Devriendt, Darie Moldovan, and Wouter Verbeke
from pyuplift.transformation import Pessimistic
...
model = Pessimistic()
model.fit(X[train_indexes, :], y[train_indexes], t[train_indexes])
uplift = model.predict(X[test_indexes, :])
print(uplift)

Reflective

The class which implements the reflective approach [1].

Parameters
model : object, optional (default=sklearn.linear_model.LogisticRegression)
The classification model which will be used for predict uplift.
Methods
fit(self, X, y, t) Build the model from the training set (X, y, t).
predict(self, X, t=None) Predict an uplift for X.
fit(self, X, y, t)

Build the model from the training set (X, y, t).

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
y: numpy array with shape = [n_samples,]
Array of target of feature.
t: numpy array with shape = [n_samples,]
Array of treatments.
Returns self : object
predict(self, X, t=None)

Predict an uplift for X.

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
t: numpy array with shape = [n_samples,] or None
Array of treatments.
Returns
self : object
The predicted values.
References
  1. A Literature Survey and Experimental Evaluation of the State-of-the-Art in Uplift Modeling: A Stepping Stone Toward the Development of Prescriptive Analytics by Floris Devriendt, Darie Moldovan, and Wouter Verbeke
from pyuplift.transformation import Reflective
...
model = Reflective()
model.fit(X[train_indexes, :], y[train_indexes], t[train_indexes])
uplift = model.predict(X[test_indexes, :])
print(uplift)
transformation.TransformationBaseModel() A base model of all classes which implements a transformation approaches.
transformation.Lai([model, use_weights]) A Lai’s approach.
transformation.Kane([model, use_weights]) A Kane’s approach.
transformation.Jaskowski([model]) A Jaskowski’s approach.
transformation.Pessimistic([model]) A pessimistic approach.
transformation.Reflective([model]) A reflective approach.

Datasets

load_criteo_uplift_prediction

Loading the Criteo Uplift Prediction dataset from the local file.

Data description

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising. It consists of 25M rows, each one representing a user with 11 features, a treatment indicator and 2 labels (visits and conversions).

Privacy

For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.

Features 11
Treatment 2
Samples total 25,309,483
Average visit rate 0.04132
Average conversion rate 0.00229

More information about dataset you can find in the official dataset description.

Parameters
data_home: str
Specify another download and cache folder for the dataset.
By default the dataset will be stored in the data folder in the same folder.
download_if_missing: bool, default=True
Download the dataset if it is not downloaded.
Returns:
dataset: dict
Dictionary object with the following attributes:
dataset.description : str
Description of the Criteo Uplift Prediction dataset.
dataset.data: numpy ndarray of shape (25309483, 11)
Each row corresponding to the 11 feature values in order.
dataset.feature_names: list, size 11
List of feature names.
dataset.treatment: numpy ndarray, shape (25309483,)
Each value corresponds to the treatment.
dataset.target: numpy array of shape (25309483,)
Each value corresponds to one of the outcomes. By default, it’s visit outcome (look at target_visit below).
dataset.target_visit: numpy array of shape (25309483,)
Each value corresponds to whether a visit occurred for this user (binary, label).
dataset.target_exposure: numpy array of shape (25309483,)
Each value corresponds to treatment effect, whether the user has been effectively exposed (binary).
dataset.target_conversion: numpy array of shape (25309483,)
Each value corresponds to whether a conversion occurred for this user (binary, label).
Examples
from pyuplift.datasets import load_criteo_uplift_prediction
df = load_criteo_uplift_prediction()
print(df)

download_criteo_uplift_prediction

Downloading the Criteo Uplift Prediction dataset.

Data description

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising. It consists of 25M rows, each one representing a user with 11 features, a treatment indicator and 2 labels (visits and conversions).

Privacy

For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.

Features 11
Treatment 2
Samples total 25,309,483
Average visit rate 0.04132
Average conversion rate 0.00229

More information about dataset you can find in the official dataset description.

Parameters:
data_home: str, default=None
The URL to file with data.
url: str, default=https://s3.us-east-2.amazonaws.com/criteo-uplift-dataset/criteo-uplift.csv.gz
The URL to file with data.
Returns:
dataset: dict
Dictionary object with the following attributes:
dataset.description : str
Description of the Criteo Uplift Prediction dataset.
dataset.data: numpy ndarray of shape (25309483, 11)
Each row corresponding to the 11 feature values in order.
dataset.feature_names: list, size 11
List of feature names.
dataset.treatment: numpy ndarray, shape (25309483,)
Each value corresponds to the treatment.
dataset.target: numpy array of shape (25309483,)
Each value corresponds to one of the outcomes. By default, it’s visit outcome (look at target_visit below).
dataset.target_visit: numpy array of shape (25309483,)
Each value corresponds to whether a visit occurred for this user (binary, label).
dataset.target_exposure: numpy array of shape (25309483,)
Each value corresponds to treatment effect, whether the user has been effectively exposed (binary).
dataset.target_conversion: numpy array of shape (25309483,)
Each value corresponds to whether a conversion occurred for this user (binary, label).
Examples
from pyuplift.datasets import download_criteo_uplift_prediction
download_criteo_uplift_prediction()

load_hillstrom_email_marketing

Loading the Hillstrom Email Marketing dataset from the local file.

Data description

This dataset contains 64,000 customers who last purchased within twelve months. The customers were involved in an e-mail test.

  • 1/3 were randomly chosen to receive an e-mail campaign featuring Mens merchandise.
  • 1/3 were randomly chosen to receive an e-mail campaign featuring Womens merchandise.
  • 1/3 were randomly chosen to not receive an e-mail campaign.

During a period of two weeks following the e-mail campaign, results were tracked. Your job is to tell the world if the Mens or Womens e-mail campaign was successful.

Features 8
Treatment 3
Samples total 64,000
Average spend rate 1.05091
Average visit rate 0.14678
Average conversion rate 0.00903

More information about dataset you can find in the official paper.

Parameters:
data_home: str, default=None
Specify another download and cache folder for the dataset.
By default the dataset will be stored in the data folder in the same folder.
load_raw_data: bool, default=False
The loading of raw or preprocessed data?
download_if_missing: bool, default=True
Download the dataset if it is not downloaded.
Returns:
dataset: dict
Dictionary object with the following attributes:
dataset.description : str
Description of the Hillstrom email marketing dataset.
dataset.data: numpy ndarray of shape (64000, 8)
Each row corresponding to the 8 feature values in order.
dataset.feature_names: list, size 8
List of feature names.
dataset.treatment: numpy ndarray, shape (64000,)
Each value corresponds to the treatment.
dataset.target: numpy array of shape (64000,)
Each value corresponds to one of the outcomes. By default, it’s spend outcome (look at target_spend below).
dataset.target_spend: numpy array of shape (64000,)
Each value corresponds to how much customers spent during a two-week outcome period.
dataset.target_visit: numpy array of shape (64000,)
Each value corresponds to whether people visited the site during a two-week outcome period.
dataset.target_conversion: numpy array of shape (64000,)
Each value corresponds to whether they purchased at the site (“conversion”) during a two-week outcome period.
Examples
from pyuplift.datasets import load_hillstrom_email_marketing
df = load_hillstrom_email_marketing()
print(df)

download_hillstrom_email_marketing

Downloading the Hillstrom Email Marketing dataset.

Data description

This dataset contains 64,000 customers who last purchased within twelve months. The customers were involved in an e-mail test.

  • 1/3 were randomly chosen to receive an e-mail campaign featuring Mens merchandise.
  • 1/3 were randomly chosen to receive an e-mail campaign featuring Womens merchandise.
  • 1/3 were randomly chosen to not receive an e-mail campaign.

During a period of two weeks following the e-mail campaign, results were tracked. Your job is to tell the world if the Mens or Womens e-mail campaign was successful.

Features 8
Treatment 3
Samples total 64,000
Average spend rate 1.05091
Average visit rate 0.14678
Average conversion rate 0.00903

More information about dataset you can find in the official paper.

Parameters
data_home: str
Specify another download and cache folder for the dataset.
By default the dataset will be stored in the data folder in the same folder.
url: str
The URL to file with data.
Returns None
Examples
from pyuplift.datasets import download_hillstrom_email_marketing
download_hillstrom_email_marketing()

load_lalonde_nsw

Loading the Lalonde NSW dataset from the local file.

Data description

The dataset contains the treated and control units from the male sub-sample from the National Supported Work Demonstration as used by Lalonde in his paper.

Features 7
Treatment 2
Samples total 722
Features description
  • treat - an indicator variable for treatment status.
  • age - age in years.
  • educ - years of schooling.
  • black - indicator variable for blacks.
  • hisp - indicator variable for Hispanics.
  • married - indicator variable for martial status.
  • nodegr - indicator variable for high school diploma.
  • re75 - real earnings in 1975.
  • re78 - real earnings in 1978.

More information about dataset you can find here.

Parameters:
data_home: str, default=None
Specify another download and cache folder for the dataset.
By default the dataset will be stored in the data folder in the same folder.
download_if_missing: bool, default=True
Download the dataset if it is not downloaded.
Returns:
dataset: dict
Dictionary object with the following attributes:
dataset.description : str
Description of the Hillstrom email marketing dataset.
dataset.data: numpy ndarray of shape (722, 7)
Each row corresponding to the 7 feature values in order.
dataset.feature_names: list, size 7
List of feature names.
dataset.treatment: numpy ndarray, shape (722,)
Each value corresponds to the treatment.
dataset.target: numpy array of shape (722,)
Each value corresponds to one of the outcomes. By default, it’s re78 outcome.
Examples
from pyuplift.datasets import load_lalonde_nsw
df = load_lalonde_nsw()
print(df)

download_lalonde_nsw

Downloading the Lalonde NSW dataset.

Data description

The dataset contains the treated and control units from the male sub-sample from the National Supported Work Demonstration as used by Lalonde in his paper.

Features 7
Treatment 2
Samples total 722
Features description
  • treat - an indicator variable for treatment status.
  • age - age in years.
  • educ - years of schooling.
  • black - indicator variable for blacks.
  • hisp - indicator variable for Hispanics.
  • married - indicator variable for martial status.
  • nodegr - indicator variable for high school diploma.
  • re75 - real earnings in 1975.
  • re78 - real earnings in 1978.

More information about dataset you can find here.

Parameters
data_home: str
Specify another download and cache folder for the dataset.
By default the dataset will be stored in the data folder in the same folder.
control_data_url: str
The URL to file with data of the control group.
treated_data_url: str
The URL to file with data of the treated group.
separator: str
The separator which used in the data files.
column_names: list
List of column names of the dataset.
column_types: dict
List of types for columns of the dataset.
random_state: int
The random seed.
Returns None
Examples
from pyuplift.datasets import download_lalonde_nsw
download_lalonde_nsw()

make_linear_regression

Generate data by formula.

Data description

Synthetic data generated by Generate data by formula:

Y' = X1 + X2 * T + E
Y = Y', if Y' - int(Y') > eps,
Y = 0,  otherwise.

Statistics for default parameters and size equals 100,000:

Features 3
Treatment 2
Samples total size
Y not equals 0 0.49438
Y values 0 to 555.93
Parameters:
size: integer
The number of observations.
x1_params : tuple(mu, sigma), default: (0, 1)
The feature with gaussian distribution and mean=mu, sd=sigma.
X1 ~ N(mu, sigma)
x2_params : tuple(mu, sigma), default: (0, 0.1)
The feature with gaussian distribution and mean=mu, sd=sigma.
X2 ~ N(mu, sigma)
x3_params : tuple(mu, sigma), default: (0, 1)
The feature with gaussian distribution and mean=mu, sd=sigma.
X3 ~ N(mu, sigma)
t_params : tuple(mu, sigma), default: (0, 1)
The treatment with uniform distribution. Min value=min, Max value=max-1
T ~ R(min, max)
e_params : tuple(mu, sigma), default: (0, 1)
The error with gaussian distribution and mean=mu, sd=sigma.
E ~ N(mu, sigma)
eps : tuple(mu, sigma), default: (0, 1)
The border value.
random_state : integer, default=777
random_state is the seed used by the random number generator.
Returns:
dataset: pandas DataFrame
Generated data.
Examples
from pyuplift.datasets import make_linear_regression
df = make_linear_regression(10000)
print(df)

The pyuplift.datasets module includes utilities to load datasets, including methods to download and return popular datasets. It also features some artificial data generators.

Loaders

datasets.download_criteo_uplift_prediction([data_home, url]) Downloading the Criteo Uplift Prediction dataset.
datasets.load_criteo_uplift_prediction([data_home, download_if_missing]) Loading the Criteo Uplift Prediction dataset from the local file.
datasets.download_hillstrom_email_marketing([data_home, url]) Downloading the Hillstrom Email Marketing dataset.
datasets.load_hillstrom_email_marketing([data_home, load_raw_data, download_if_missing]) Loading the Hillstrom Email Marketing dataset from the local file.
datasets.download_lalonde_nsw([data_home, control_data_url, treated_data_url, separator, column_names, column_types, random_state]) Downloading the Lalonde NSW dataset.
datasets.load_lalonde_nsw([data_home, load_raw_data, download_if_missing]) Loading the Lalonde NSW dataset from the local file.

Generators

datasets.make_linear_regression(size, [x1_params, x2_params, x3_params, t_params, e_params, eps, seed])
Generate data by formula: Y’ = X1+X2*T+E
Y = Y’, if Y’ - int(Y’) > eps,
Y = 0, otherwise.

Model Selection

train_test_split

Split X, y, t into random train and test subsets.

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
y: numpy array with shape = [n_samples,]
Array of target of feature.
t: numpy array with shape = [n_samples,]
Array of treatments.
train_share: float, optional (default=0.7)
train_share represents the proportion of the dataset to include in the train split.
random_state: int, optional (default=None)
random_state is the seed used by the random number generator.
Return
X_train: numpy ndarray
Train matrix of features.
X_test: numpy ndarray
Test matrix of features.
y_train: numpy array
Train array of target of feature.
y_test: numpy array
Test array of target of feature.
t_train: numpy array
Train array of treatments.
t_test: numpy array
Test array of treatments.
Examples
from pyuplift.model_selection import train_test_split
...
for seed in seeds:
    X_train, X_test, y_train, y_test, t_train, t_test = train_test_split(X, y, t, train_share, seed)
    model.fit(X_train, y_train, t_train)
    score = get_average_effect(y_test, t_test, model.predict(X_test))
    scores.append(score)

treatment_cross_val_score

Evaluate a scores by cross-validation.

Parameters
X: numpy ndarray with shape = [n_samples, n_features]
Matrix of features.
y: numpy array with shape = [n_samples,]
Array of target of feature.
t: numpy array with shape = [n_samples,]
Array of treatments.
train_share: float, optional (default=0.7)
train_share represents the proportion of the dataset to include in the train split.
random_state: int, optional (default=777)
random_state is the seed used by the random number generator.
Return
scores: numpy array of floats
Array of scores of the estimator for each run of the cross validation.
Examples
from pyuplift.model_selection import treatment_cross_val_score
...
for model_name in models:
    scores = treatment_cross_val_score(X, y, t, models[model_name], cv, seeds=seeds)

The pyuplift.model_selection module includes model validation and splitter functions.

Splitter Functions

model_selection.train_test_split(X, y, t, [train_share, random_state]) Split X, y, t into random train and test subsets.

Model validation

model_selection.treatment_cross_val_score(X, y, t, model, [cv, train_share, seeds]) Evaluate a scores by cross-validation.

Metrics

get_average_effect

Estimating an average effect of the test set.

Parameters:
y_test: numpy array
Actual y values.
t_test: numpy array
Actual treatment values.
y_pred: numpy array
Predicted y values by uplift model.
test_share: float
Share of the test data which will be taken for estimating an average effect.
Returns:
average effect: float
Average effect on the test set.
Examples
from pyuplift.metrics import get_average_effect
...
model.fit(X_train, y_train, t_train)
y_pred = model.predict(X_test)
effect = get_average_effect(y_test, t_test, y_pred, test_share)
print(effect)

The pyuplift.metrics module includes score functions, performance metrics and pairwise metrics and distance computations.

metrics.get_average_effect(y_test, t_test, y_pred, [test_share]) Estimating an average effect of the test set.

Utilities

download_file

Download file from url to output_path.

Parameters
url: string
Data’s URL.
output_path: string
Path where file will be saved.
Returns None
Examples
from pyuplift.utils import download_file
...
 if not os.path.exists(data_path):
     if not os.path.exists(archive_path):
         download_file(url, archive_path)

retrieve_from_gz

The retrieving gz-archived data from archive_path to output_path.

Parameters
archive_path: string
The archive path.
output_path: string
The retrieved data path.
Returns None
Examples
from pyuplift.utils import retrieve_from_gz
...
 if not os.path.exists(data_path):
     if not os.path.exists(archive_path):
         download_file(url, archive_path)
             retrieve_from_gz(archive_path, data_path)

The pyuplift.utils module includes various utilities.

utils.download_file(url, output_path) Download file from url to output_path.
utils.retrieve_from_gz(archive_path, output_path) The retrieving gz-archived data from archive_path to output_path