CausalML is a Python package that provides a suite of uplift modeling and causal inference methods using machine learning algorithms based on recent research.
It provides a standard interface that allows user to estimate the Conditional Average Treatment Effect (CATE), also known as Individual Treatment Effect (ITE), from experimental or observational data.
Essentially, it estimates the causal impact of intervention W on outcome Y for users with observed features X, without strong assumptions on the model form.
CausalML is committed to democratizing causal machine learning through accessible, innovative, and well-documented open-source tools that empower data scientists, researchers, and organizations. At our core, we embrace inclusivity and foster a vibrant community where members exchange ideas, share knowledge, and collaboratively shape a future where CausalML drives advancements across diverse domains.
Causal machine learning is a branch of machine learning that focuses on understanding the cause and effect relationships in data. It goes beyond just predicting outcomes based on patterns in the data, and tries to understand how changing one variable can affect an outcome.
Suppose we are trying to predict a student’s test score based on how many hours they study and how much sleep they get. Traditional machine learning models would find patterns in the data, like students who study more or sleep more tend to get higher scores.
But what if you want to know what would happen if a student studied an extra hour each day? Or slept an extra hour each night? Modeling these potential outcomes or counterfactuals is where causal machine learning comes in. It tries to understand cause-and-effect relationships - how much changing one variable (like study hours or sleep hours) will affect the outcome (the test score).
This is useful in many fields, including economics, healthcare, and policy making, where understanding the impact of interventions is crucial.
While traditional machine learning is great for prediction, causal machine learning helps us understand the difference in outcomes due to interventions.
Traditional machine learning and causal machine learning are both powerful tools, but they serve different purposes and answer different types of questions.
Traditional Machine Learning is primarily concerned with prediction. Given a set of input features, it learns a function from the data that can predict an outcome. It’s great at finding patterns and correlations in large datasets, but it doesn’t tell us about the cause-and-effect relationships between variables. It answers questions like “Given a patient’s symptoms, what disease are they likely to have?”
On the other hand, Causal Machine Learning is concerned with understanding the cause-and-effect relationships between variables. It goes beyond prediction and tries to answer questions about intervention: “What will happen if we change this variable?” For example, in a medical context, it could help answer questions like “What will happen if a patient takes this medication?”
In essence, while traditional machine learning can tell us “what is”, causal machine learning can help us understand “what if”. This makes causal machine learning particularly useful in fields where we need to make decisions based on data, such as policy making, economics, and healthcare.
Randomized Control Trials (RCT) are the gold standard for causal effect measurements. Subjects are randomly exposed to a treatment and the Average Treatment Effect (ATE) is measured as the difference between the mean effects in the treatment and control groups. Random assignment removes the effect of any confounders on the treatment.
If an RCT is available and the treatment effects are heterogeneous across covariates, measuring the conditional average treatment effect(CATE) can be of interest. The CATE is an estimate of the treatment effect conditioned on all available experiment covariates and confounders. We call these Heterogeneous Treatment Effects (HTEs).
Campaign Targeting Optimization: An important lever to increase ROI in an advertising campaign is to target the ad to the set of customers who will have a favorable response in a given KPI such as engagement or sales. CATE identifies these customers by estimating the effect of the KPI from ad exposure at the individual level from A/B experiment or historical observational data.
Personalized Engagement: A company might have multiple options to interact with its customers such as different product choices in up-sell or different messaging channels for communications. One can use CATE to estimate the heterogeneous treatment effect for each customer and treatment option combination for an optimal personalized engagement experience.
conda environment files for Python 3.7, 3.8 and 3.9 are available in the repository. To use models under the inference.tf module (e.g. DragonNet), additional dependency of tensorflow is required. For detailed instructions, see below.
This will create a new conda virtual environment named causalml-[tf-]py3x, where x is in [7,8,9]. e.g. causalml-py37 or causalml-tf-py38. If you want to change the name of the environment, update the relevant YAML file in envs/.
gitclonehttps://github.com/uber/causalml.git
cdcausalml/envs/
condaenvcreate-fenvironment-py38.yml# for the virtual environment with Python 3.8 and CausalML
condaactivatecausalml-py38
gitclonehttps://github.com/uber/causalml.git
cdcausalml/envs/
condaenvcreate-fenvironment-tf-py38.yml# for the virtual environment with Python 3.8 and CausalML
condaactivatecausalml-tf-py38
pipinstall-Unumpy# this step is necessary to fix [#338](https://github.com/uber/causalml/issues/338)
fromcausalml.optimizeimportCounterfactualValueEstimatorfromcausalml.optimizeimportget_treatment_costs,get_actual_value# load data set and train test splitdf_train,df_test=train_test_split(df)train_idx=df_train.indextest_idx=df_test.index# some more code here to initiate and train the Model, and produce tm_pred# please refer to the counterfactual_value_optimization notebook for complete example# run the counterfactual calculation with TwoModel predictioncve=CounterfactualValueEstimator(treatment=df_test['treatment_group_key'],control_name='control',treatment_names=conditions[1:],y_proba=y_proba,cate=tm_pred,value=conversion_value_array[test_idx],conversion_cost=cc_array[test_idx],impression_cost=ic_array[test_idx])cve_best_idx=cve.predict_best()cve_best=[conditions[idx]foridxincve_best_idx]actual_is_cve_best=df.loc[test_idx,'treatment_group_key']==cve_bestcve_value=actual_value.loc[test_idx][actual_is_cve_best].mean()labels=['Random allocation','Best treatment','T-Learner','CounterfactualValueEstimator']values=[random_allocation_value,best_ate_value,tm_value,cve_value]# plot the resultplt.bar(labels,values)
fromcausalml.datasetimport*# Generate synthetic data for single simulationy,X,treatment,tau,b,e=synthetic_data(mode=1)y,X,treatment,tau,b,e=simulate_nuisance_and_easy_treatment()# Generate predictions for single simulationsingle_sim_preds=get_synthetic_preds(simulate_nuisance_and_easy_treatment,n=1000)# Generate multiple scatter plots to compare learner performance for a single simulationscatter_plot_single_sim(single_sim_preds)# Visualize distribution of learner predictions for a single simulationdistr_plot_single_sim(single_sim_preds,kind='kde')
fromcausalml.datasetimport*# Generalize performance summary over k simulationsnum_simulations=12preds_summary=get_synthetic_summary(simulate_nuisance_and_easy_treatment,n=1000,k=num_simulations)# Generate scatter plot of performance summaryscatter_plot_summary(preds_summary,k=num_simulations)# Generate bar plot of performance summarybar_plot_summary(preds_summary,k=num_simulations)
fromcausalml.metrics.sensitivityimportSensitivityfromcausalml.metrics.sensitivityimportSensitivitySelectionBiasfromcausalml.inference.metaimportBaseXLearnerfromsklearn.linear_modelimportLinearRegression# Calling the Base XLearner class and return the sensitivity analysis summary reportlearner_x=BaseXLearner(LinearRegression())sens_x=Sensitivity(df=df,inference_features=INFERENCE_FEATURES,p_col='pihat',treatment_col=TREATMENT_COL,outcome_col=OUTCOME_COL,learner=learner_x)# Here for Selection Bias method will use default one-sided confounding function and alpha (quantile range of outcome values) inputsens_sumary_x=sens_x.sensitivity_analysis(methods=['Placebo Treatment','Random Cause','Subset Data','Random Replace','Selection Bias'],sample_size=0.5)# Selection Bias: Alignment confounding Functionsens_x_bias_alignment=SensitivitySelectionBias(df,INFERENCE_FEATURES,p_col='pihat',treatment_col=TREATMENT_COL,outcome_col=OUTCOME_COL,learner=learner_x,confound='alignment',alpha_range=None)# Plot the results by rsquare with partial r-square results by each individual featuressens_x_bias_alignment.plot(lls_x_bias_alignment,partial_rsqs_x_bias_alignment,type='r.squared',partial_rsqs=True)
For more details, please refer to the feature_selection.ipynb notebook and the associated paper reference by Zhao, Zhenyu, et al.
fromcausalml.feature_selection.filtersimportFilterSelectfromcausalml.datasetimportmake_uplift_classification# define parameters for simulationy_name='conversion'treatment_group_keys=['control','treatment1']n=100000n_classification_features=50n_classification_informative=10n_classification_repeated=0n_uplift_increase_dict={'treatment1':8}n_uplift_decrease_dict={'treatment1':4}delta_uplift_increase_dict={'treatment1':0.1}delta_uplift_decrease_dict={'treatment1':-0.1}# make a synthetic uplift data setrandom_seed=20200808df,X_names=make_uplift_classification(treatment_name=treatment_group_keys,y_name=y_name,n_samples=n,n_classification_features=n_classification_features,n_classification_informative=n_classification_informative,n_classification_repeated=n_classification_repeated,n_uplift_increase_dict=n_uplift_increase_dict,n_uplift_decrease_dict=n_uplift_decrease_dict,delta_uplift_increase_dict=delta_uplift_increase_dict,delta_uplift_decrease_dict=delta_uplift_decrease_dict,random_seed=random_seed)# Feature selection with Filter methodfilter_f=FilterSelect()method='F'f_imp=filter_f.get_importance(df,X_names,y_name,method,treatment_group='treatment1')print(f_imp)# Use likelihood ratio test methodmethod='LR'lr_imp=filter_f.get_importance(df,X_names,y_name,method,treatment_group='treatment1')print(lr_imp)# Use KL divergence methodmethod='KL'kl_imp=filter_f.get_importance(df,X_names,y_name,method,treatment_group='treatment1',n_bins=10)print(kl_imp)
In this notebook, we will generate some synthetic data to demonstrate how to use the various Meta-Learner algorithms in order to estimate Individual Treatment Effects and Average Treatment Effects with confidence intervals.
[1]:
%load_ext autoreload
%autoreload 2
[2]:
from causalml.inference.meta import LRSRegressor
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/utils/_clustering.py:35: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _pt_shuffle_rec(i, indexes, index_mask, partition_tree, M, pos):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/utils/_clustering.py:54: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def delta_minimization_order(all_masks, max_swap_size=100, num_passes=2):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/utils/_clustering.py:63: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _reverse_window(order, start, length):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/utils/_clustering.py:69: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _reverse_window_score_gain(masks, order, start, length):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/utils/_clustering.py:77: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _mask_delta_score(m1, m2):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/links.py:5: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def identity(x):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/links.py:10: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _identity_inverse(x):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/links.py:15: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def logit(x):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/links.py:20: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _logit_inverse(x):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/utils/_masked_model.py:363: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _build_fixed_single_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/utils/_masked_model.py:385: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _build_fixed_multi_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/utils/_masked_model.py:428: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _init_masks(cluster_matrix, M, indices_row_pos, indptr):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/utils/_masked_model.py:439: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _rec_fill_masks(cluster_matrix, indices_row_pos, indptr, indices, M, ind):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/maskers/_tabular.py:186: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _single_delta_mask(dind, masked_inputs, last_mask, data, x, noop_code):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/maskers/_tabular.py:197: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _delta_masking(masks, x, curr_delta_inds, varying_rows_out,
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/maskers/_image.py:175: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def _jit_build_partition_tree(xmin, xmax, ymin, ymax, zmin, zmax, total_ywidth, total_zwidth, M, clustering, q):
/Users/jeong/miniconda3/envs/causalml/lib/python3.8/site-packages/shap/explainers/_partition.py:676: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def lower_credit(i, value, M, values, clustering):
The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
[3]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from xgboost import XGBRegressor
import warnings
from causalml.inference.meta import LRSRegressor
from causalml.inference.meta import XGBTRegressor, MLPTRegressor
from causalml.inference.meta import BaseXRegressor, BaseRRegressor, BaseSRegressor, BaseTRegressor
from causalml.match import NearestNeighborMatch, MatchOptimizer, create_table_one
from causalml.propensity import ElasticNetPropensityModel
from causalml.dataset import *
from causalml.metrics import *
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')
%matplotlib inline
Failed to import duecredit due to No module named 'duecredit'
We have implemented 4 modes of generating synthetic data (specified by input parameter mode). Refer to the References section for more detail on these data generation processes.
[5]:
# Generate synthetic data using mode 1
y, X, treatment, tau, b, e = synthetic_data(mode=1, n=10000, p=8, sigma=1.0)
A meta-learner can be instantiated by calling a base learner class and providing an sklearn/xgboost regressor class as input. Alternatively, we have provided some ready-to-use learners that have already inherited their respective base learner class capabilities. This is more abstracted and allows these tools to be quickly and readily usable.
[6]:
# Ready-to-use S-Learner using LinearRegression
learner_s = LRSRegressor()
ate_s = learner_s.estimate_ate(X=X, treatment=treatment, y=y)
print(ate_s)
print('ATE estimate: {:.03f}'.format(ate_s[0][0]))
print('ATE lower bound: {:.03f}'.format(ate_s[1][0]))
print('ATE upper bound: {:.03f}'.format(ate_s[2][0]))
# After calling estimate_ate, add pretrain=True flag to skip training
# This flag is applicable for other meta learner
ate_s = learner_s.estimate_ate(X=X, treatment=treatment, y=y, pretrain=True)
print(ate_s)
print('ATE estimate: {:.03f}'.format(ate_s[0][0]))
print('ATE lower bound: {:.03f}'.format(ate_s[1][0]))
print('ATE upper bound: {:.03f}'.format(ate_s[2][0]))
(array([0.72721128]), array([0.67972656]), array([0.77469599]))
ATE estimate: 0.727
ATE lower bound: 0.680
ATE upper bound: 0.775
(array([0.72721128]), array([0.67972656]), array([0.77469599]))
ATE estimate: 0.727
ATE lower bound: 0.680
ATE upper bound: 0.775
[7]:
# Ready-to-use T-Learner using XGB
learner_t = XGBTRegressor()
ate_t = learner_t.estimate_ate(X=X, treatment=treatment, y=y)
print('Using the ready-to-use XGBTRegressor class')
print(ate_t)
# Calling the Base Learner class and feeding in XGB
learner_t = BaseTRegressor(learner=XGBRegressor())
ate_t = learner_t.estimate_ate(X=X, treatment=treatment, y=y)
print('\nUsing the BaseTRegressor class and using XGB (same result):')
print(ate_t)
# Calling the Base Learner class and feeding in LinearRegression
learner_t = BaseTRegressor(learner=LinearRegression())
ate_t = learner_t.estimate_ate(X=X, treatment=treatment, y=y)
print('\nUsing the BaseTRegressor class and using Linear Regression (different result):')
print(ate_t)
Using the ready-to-use XGBTRegressor class
(array([0.55539207]), array([0.53185148]), array([0.57893267]))
Using the BaseTRegressor class and using XGB (same result):
(array([0.55539207]), array([0.53185148]), array([0.57893267]))
Using the BaseTRegressor class and using Linear Regression (different result):
(array([0.71740976]), array([0.67655445]), array([0.75826507]))
[8]:
# X Learner with propensity score input
# Calling the Base Learner class and feeding in XGB
learner_x = BaseXRegressor(learner=XGBRegressor())
ate_x = learner_x.estimate_ate(X=X, treatment=treatment, y=y, p=e)
print('Using the BaseXRegressor class and using XGB:')
print(ate_x)
# Calling the Base Learner class and feeding in LinearRegression
learner_x = BaseXRegressor(learner=LinearRegression())
ate_x = learner_x.estimate_ate(X=X, treatment=treatment, y=y, p=e)
print('\nUsing the BaseXRegressor class and using Linear Regression:')
print(ate_x)
Using the BaseXRegressor class and using XGB:
(array([0.52239345]), array([0.50279387]), array([0.54199302]))
Using the BaseXRegressor class and using Linear Regression:
(array([0.71740976]), array([0.67655445]), array([0.75826507]))
[9]:
# X Learner without propensity score input
# Calling the Base Learner class and feeding in XGB
learner_x = BaseXRegressor(XGBRegressor())
ate_x = learner_x.estimate_ate(X=X, treatment=treatment, y=y)
print('Using the BaseXRegressor class and using XGB without propensity score input:')
print(ate_x)
# Calling the Base Learner class and feeding in LinearRegression
learner_x = BaseXRegressor(learner=LinearRegression())
ate_x = learner_x.estimate_ate(X=X, treatment=treatment, y=y)
print('\nUsing the BaseXRegressor class and using Linear Regression without propensity score input:')
print(ate_x)
Using the BaseXRegressor class and using XGB without propensity score input:
(array([0.52348025]), array([0.50385245]), array([0.54310804]))
Using the BaseXRegressor class and using Linear Regression without propensity score input:
(array([0.71740976]), array([0.67655445]), array([0.75826507]))
[10]:
# R Learner with propensity score input
# Calling the Base Learner class and feeding in XGB
learner_r = BaseRRegressor(learner=XGBRegressor())
ate_r = learner_r.estimate_ate(X=X, treatment=treatment, y=y, p=e)
print('Using the BaseRRegressor class and using XGB:')
print(ate_r)
# Calling the Base Learner class and feeding in LinearRegression
learner_r = BaseRRegressor(learner=LinearRegression())
ate_r = learner_r.estimate_ate(X=X, treatment=treatment, y=y, p=e)
print('Using the BaseRRegressor class and using Linear Regression:')
print(ate_r)
Using the BaseRRegressor class and using XGB:
(array([0.51551318]), array([0.5150305]), array([0.51599587]))
Using the BaseRRegressor class and using Linear Regression:
(array([0.51503495]), array([0.51461987]), array([0.51545004]))
[11]:
# R Learner with propensity score input and random sample weight
# Calling the Base Learner class and feeding in XGB
learner_r = BaseRRegressor(learner=XGBRegressor())
sample_weight = np.random.randint(1, 3, len(y))
ate_r = learner_r.estimate_ate(X=X, treatment=treatment, y=y, p=e, sample_weight=sample_weight)
print('Using the BaseRRegressor class and using XGB:')
print(ate_r)
Using the BaseRRegressor class and using XGB:
(array([0.48910448]), array([0.48861819]), array([0.48959077]))
[12]:
# R Learner without propensity score input
# Calling the Base Learner class and feeding in XGB
learner_r = BaseRRegressor(learner=XGBRegressor())
ate_r = learner_r.estimate_ate(X=X, treatment=treatment, y=y)
print('Using the BaseRRegressor class and using XGB without propensity score input:')
print(ate_r)
# Calling the Base Learner class and feeding in LinearRegression
learner_r = BaseRRegressor(learner=LinearRegression())
ate_r = learner_r.estimate_ate(X=X, treatment=treatment, y=y)
print('Using the BaseRRegressor class and using Linear Regression without propensity score input:')
print(ate_r)
Using the BaseRRegressor class and using XGB without propensity score input:
(array([0.45400543]), array([0.45352042]), array([0.45449043]))
Using the BaseRRegressor class and using Linear Regression without propensity score input:
(array([0.59802659]), array([0.59761147]), array([0.5984417]))
# Single simulation
train_preds, valid_preds = get_synthetic_preds_holdout(simulate_nuisance_and_easy_treatment,
n=50000,
valid_size=0.2)
[21]:
#distribution plot for signle simulation of Training
distr_plot_single_sim(train_preds, kind='kde', linewidth=2, bw_method=0.5,
drop_learners=['S Learner (LR)',' S Learner (XGB)'])
[22]:
#distribution plot for signle simulation of Validaiton
distr_plot_single_sim(valid_preds, kind='kde', linewidth=2, bw_method=0.5,
drop_learners=['S Learner (LR)', 'S Learner (XGB)'])
[23]:
# Scatter Plots for a Single Simulation of Training Data
scatter_plot_single_sim(train_preds)
[24]:
# Scatter Plots for a Single Simulation of Validaiton Data
scatter_plot_single_sim(valid_preds)
[25]:
# Cumulative Gain AUUC values for a Single Simulation of Training Data
get_synthetic_auuc(train_preds, drop_learners=['S Learner (LR)'])
[25]:
Learner
cum_gain_auuc
0
Actuals
4.934321e+06
2
T Learner (LR)
4.932595e+06
4
X Learner (LR)
4.932595e+06
6
R Learner (LR)
4.931463e+06
1
S Learner (XGB)
4.707889e+06
5
X Learner (XGB)
4.507384e+06
3
T Learner (XGB)
4.389641e+06
7
R Learner (XGB)
4.309501e+06
8
Random
4.002357e+06
[26]:
# Cumulative Gain AUUC values for a Single Simulation of Validaiton Data
get_synthetic_auuc(valid_preds, drop_learners=['S Learner (LR)'])
In this notebook, we use synthetic data to demonstrate the use of the tree-based algorithms.
[3]:
import numpy as np
import pandas as pd
from causalml.dataset import make_uplift_classification
from causalml.inference.tree import UpliftRandomForestClassifier
from causalml.metrics import plot_gain
from sklearn.model_selection import train_test_split
The CausalML package contains various functions to generate synthetic datasets for uplift modeling. Here we generate a classification dataset using the make_uplift_classification() function.
[3]:
df, x_names = make_uplift_classification()
[4]:
df.head()
[4]:
treatment_group_key
x1_informative
x2_informative
x3_informative
x4_informative
x5_informative
x6_irrelevant
x7_irrelevant
x8_irrelevant
x9_irrelevant
...
x12_uplift_increase
x13_increase_mix
x14_uplift_increase
x15_uplift_increase
x16_increase_mix
x17_uplift_increase
x18_uplift_increase
x19_increase_mix
conversion
treatment_effect
0
control
-0.542888
1.976361
-0.531359
-2.354211
-0.380629
-2.614321
-0.128893
0.448689
-2.275192
...
-1.315304
0.742654
1.891699
-2.428395
1.541875
-0.817705
-0.610194
-0.591581
0
0
1
treatment3
0.258654
0.552412
1.434239
-1.422311
0.089131
0.790293
1.159513
1.578868
0.166540
...
-1.391878
-0.623243
2.443972
-2.889253
2.018585
-1.109296
-0.380362
-1.667606
0
0
2
treatment1
1.697012
-2.762600
-0.662874
-1.682340
1.217443
0.837982
1.042981
0.177398
-0.112409
...
-1.132497
1.050179
1.573054
-1.788427
1.341609
-0.749227
-2.091521
-0.471386
0
0
3
treatment2
-1.441644
1.823648
0.789423
-0.295398
0.718509
-0.492993
0.947824
-1.307887
0.123340
...
-2.084619
0.058481
1.369439
0.422538
1.087176
-0.966666
-1.785592
-1.268379
1
1
4
control
-0.625074
3.002388
-0.096288
1.938235
3.392424
-0.465860
-0.919897
-1.072592
-1.331181
...
-1.403984
0.760430
1.917635
-2.347675
1.560946
-0.833067
-1.407884
-0.781343
0
0
5 rows × 22 columns
[5]:
# Look at the conversion rate and sample size in each group
df.pivot_table(values='conversion',
index='treatment_group_key',
aggfunc=[np.mean, np.size],
margins=True)
In this section, we first fit the uplift random forest classifier using training data. We then use the fitted model to make a prediction using testing data. The prediction returns an ndarray in which each column contains the predicted uplift if the unit was in the corresponding treatment group.
[6]:
# Split data to training and testing samples for model validation (next section)
df_train, df_test = train_test_split(df, test_size=0.2, random_state=111)
[7]:
from causalml.inference.tree import UpliftTreeClassifier
[8]:
clf = UpliftTreeClassifier(control_name='control')
clf.fit(df_train[x_names].values,
treatment=df_train['treatment_group_key'].values,
y=df_train['conversion'].values)
p = clf.predict(df_test[x_names].values)
# Put the predictions to a DataFrame for a neater presentation
# The output of `predict()` is a numpy array with the shape of [n_sample, n_treatment] excluding the
# predictions for the control group.
result = pd.DataFrame(y_pred,
columns=uplift_model.classes_[1:])
result.head()
The uplift curve is calculated on a synthetic population that consists of those that were in the control group and those who happened to be in the treatment group recommended by the model. We use the synthetic population to calculate the actual treatment effect within predicted treatment effect quantiles. Because the data is randomized, we have a roughly equal number of treatment and control observations in the predicted quantiles and there is no self selection to treatment groups.
[16]:
# If all deltas are negative, assing to control; otherwise assign to the treatment
# with the highest delta
best_treatment = np.where((result < 0).all(axis=1),
'control',
result.idxmax(axis=1))
# Create indicator variables for whether a unit happened to have the
# recommended treatment or was in the control group
actual_is_best = np.where(df_test['treatment_group_key'] == best_treatment, 1, 0)
actual_is_control = np.where(df_test['treatment_group_key'] == 'control', 1, 0)
Calculate the observed treatment effect per predicted treatment effect quantile
We use the observed treatment effect to calculate the uplift curve, which answers the question: how much of the total cumulative uplift could we have captured by targeting a subset of the population sorted according to the predicted uplift, from highest to lowest?
CausalML has the plot_gain() function which calculates the uplift curve given a DataFrame containing the treatment assignment, observed outcome and the predicted treatment effect.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from xgboost import XGBRegressor, XGBClassifier
import warnings
# from causalml.inference.meta import XGBTLearner, MLPTLearner
from causalml.inference.meta import BaseSRegressor, BaseTRegressor, BaseXRegressor, BaseRRegressor
from causalml.inference.meta import BaseSClassifier, BaseTClassifier, BaseXClassifier, BaseRClassifier
from causalml.inference.meta import LRSRegressor
from causalml.match import NearestNeighborMatch, MatchOptimizer, create_table_one
from causalml.propensity import ElasticNetPropensityModel
from causalml.dataset import *
from causalml.metrics import *
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.4f' % x)
# imports from package
import logging
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae
import statsmodels.api as sm
from copy import deepcopy
logger = logging.getLogger('causalml')
logging.basicConfig(level=logging.INFO)
%matplotlib inline
/Users/jeong/.conda/envs/py36/lib/python3.6/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.utils.testing module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
warnings.warn(message, FutureWarning)
# Generate synthetic data using mode 1
y, X, treatment, tau, b, e = synthetic_data(mode=1, n=10000, p=8, sigma=1.0)
treatment = np.array(['treatment_a' if val==1 else 'control' for val in treatment])
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
INFO:causalml:Bootstrap Confidence Intervals for ATE
100%|██████████| 100/100 [01:56<00:00, 1.17s/it]
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
INFO:causalml:Bootstrap Confidence Intervals for ATE
100%|██████████| 100/100 [02:42<00:00, 1.63s/it]
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
Note: we randomize the assignment of treatment flag AFTER the synthetic data generation process, so it doesn’t make sense to measure accuracy metrics here. Next steps would be to include multi-treatment in the DGP itself.
[69]:
# Generate synthetic data using mode 1
y, X, treatment, tau, b, e = synthetic_data(mode=1, n=10000, p=8, sigma=1.0)
treatment = np.array([('treatment_a' if np.random.random() > 0.2 else 'treatment_b')
if val==1 else 'control' for val in treatment])
e = {group: e for group in np.unique(treatment)}
[70]:
pd.Series(treatment).value_counts()
[70]:
control 4768
treatment_a 4146
treatment_b 1086
dtype: int64
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
INFO:causalml:training the treatment effect model for treatment_b with R-loss
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
INFO:causalml:training the treatment effect model for treatment_b with R-loss
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
INFO:causalml:training the treatment effect model for treatment_b with R-loss
INFO:causalml:Bootstrap Confidence Intervals for ATE
100%|██████████| 100/100 [02:19<00:00, 1.39s/it]
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
INFO:causalml:training the treatment effect model for treatment_b with R-loss
INFO:causalml:Bootstrap Confidence Intervals for ATE
100%|██████████| 100/100 [02:19<00:00, 1.39s/it]
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
INFO:causalml:training the treatment effect model for treatment_b with R-loss
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
INFO:causalml:training the treatment effect model for treatment_b with R-loss
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
INFO:causalml:training the treatment effect model for treatment_b with R-loss
INFO:causalml:Bootstrap Confidence Intervals
100%|██████████| 100/100 [00:37<00:00, 2.65it/s]
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment_a with R-loss
INFO:causalml:training the treatment effect model for treatment_b with R-loss
INFO:causalml:Bootstrap Confidence Intervals
2%|▏ | 2/100 [00:00<00:36, 2.69it/s]
This notebook will show how to use visualization for:
Uplift Tree and Uplift Random Forest
Visualize a trained uplift classification tree model
Visualize an uplift tree in a trained uplift random forests
Training and Validation Data
Visualize the validation tree: fill the trained uplift classification tree with validation (or testing) data, and show the statistics for both training data and validation data
One Treatment Group and Multiple Treatment Groups
Visualize the case where there are one control group and one treatment group
Visualize the case where there are one control group and multiple treatment groups
from causalml.dataset import make_uplift_classification
from causalml.inference.tree import UpliftTreeClassifier, UpliftRandomForestClassifier
from causalml.inference.tree import uplift_tree_string, uplift_tree_plot
[2]:
import numpy as np
import pandas as pd
from IPython.display import Image
from sklearn.model_selection import train_test_split
One Control + One Treatment for Uplift Classification Tree
[3]:
# Data generation
df, x_names = make_uplift_classification()
# Rename features for easy interpretation of visualization
x_names_new = ['feature_%s'%(i) for i in range(len(x_names))]
rename_dict = {x_names[i]:x_names_new[i] for i in range(len(x_names))}
df = df.rename(columns=rename_dict)
x_names = x_names_new
df.head()
df = df[df['treatment_group_key'].isin(['control','treatment1'])]
# Look at the conversion rate and sample size in each group
df.pivot_table(values='conversion',
index='treatment_group_key',
aggfunc=[np.mean, np.size],
margins=True)
[3]:
mean
size
conversion
conversion
treatment_group_key
control
0.5110
1000
treatment1
0.5140
1000
All
0.5125
2000
[4]:
# Split data to training and testing samples for model validation (next section)
df_train, df_test = train_test_split(df, test_size=0.2, random_state=111)
# Train uplift tree
uplift_model = UpliftTreeClassifier(max_depth = 4, min_samples_leaf = 200, min_samples_treatment = 50, n_reg = 100, evaluationFunction='KL', control_name='control')
uplift_model.fit(df_train[x_names].values,
treatment=df_train['treatment_group_key'].values,
y=df_train['conversion'].values)
[5]:
# Print uplift tree as a string
result = uplift_tree_string(uplift_model.fitted_uplift_tree, x_names)
feature_17 >= -0.44234212654232735?
yes -> feature_10 >= 1.020659213325515?
yes -> [0.3813559322033898, 0.6065573770491803]
no -> [0.5078125, 0.5267857142857143]
no -> feature_9 >= 0.8142773340486678?
yes -> [0.4596774193548387, 0.61]
no -> feature_4 >= 0.280545459525536?
yes -> [0.5522875816993464, 0.4143302180685358]
no -> [0.5070422535211268, 0.5748031496062992]
uplift score: the treatment effect between treatment and control (when there are multiple treatment groups, this is the maximum of the treatment effects)
uplift p_value: the p_value for the treatment effect
validation uplift score: when validation data is filled in the tree, this reflects the uplift score based on the - validation data. It can be compared with the uplift score (for training data) to check if there are over-fitting issue.
[6]:
# Plot uplift tree
graph = uplift_tree_plot(uplift_model.fitted_uplift_tree,x_names)
Image(graph.create_png())
[6]:
Visualize Validation Tree: One Control + One Treatment for Uplift Classification Tree
Note the validation uplift score will update.
[7]:
### Fill the trained tree with testing data set
# The uplift score based on testing dataset is shown as validation uplift score in the tree nodes
uplift_model.fill(X=df_test[x_names].values, treatment=df_test['treatment_group_key'].values, y=df_test['conversion'].values)
# Plot uplift tree
graph = uplift_tree_plot(uplift_model.fitted_uplift_tree,x_names)
Image(graph.create_png())
# Split data to training and testing samples for model validation (next section)
df_train, df_test = train_test_split(df, test_size=0.2, random_state=111)
# Train uplift tree
uplift_model = UpliftRandomForestClassifier(n_estimators=5, max_depth = 5, min_samples_leaf = 200, min_samples_treatment = 50, n_reg = 100, evaluationFunction='KL', control_name='control')
uplift_model.fit(df_train[x_names].values,
treatment=df_train['treatment_group_key'].values,
y=df_train['conversion'].values)
[9]:
# Specify a tree in the random forest (the index can be any integer from 0 to n_estimators-1)
uplift_tree = uplift_model.uplift_forest[0]
# Print uplift tree as a string
result = uplift_tree_string(uplift_tree.fitted_uplift_tree, x_names)
feature_0 >= -0.44907381030867755?
yes -> feature_6 >= -0.0583060585067711?
yes -> feature_9 >= 0.03401322870693866?
yes -> [0.4774193548387097, 0.5396825396825397]
no -> [0.34615384615384615, 0.6129032258064516]
no -> feature_12 >= 0.4863045964698285?
yes -> [0.48299319727891155, 0.5714285714285714]
no -> [0.582089552238806, 0.4452054794520548]
no -> feature_10 >= 1.0043523431178796?
yes -> [0.4807692307692308, 0.35766423357664234]
no -> [0.5229357798165137, 0.5426356589147286]
[10]:
# Plot uplift tree
graph = uplift_tree_plot(uplift_tree.fitted_uplift_tree,x_names)
Image(graph.create_png())
### Fill the trained tree with testing data set
# The uplift score based on testing dataset is shown as validation uplift score in the tree nodes
uplift_tree.fill(X=df_test[x_names].values, treatment=df_test['treatment_group_key'].values, y=df_test['conversion'].values)
# Plot uplift tree
graph = uplift_tree_plot(uplift_tree.fitted_uplift_tree,x_names)
Image(graph.create_png())
# Data generation
df, x_names = make_uplift_classification()
# Look at the conversion rate and sample size in each group
df.pivot_table(values='conversion',
index='treatment_group_key',
aggfunc=[np.mean, np.size],
margins=True)
[12]:
mean
size
conversion
conversion
treatment_group_key
control
0.511
1000
treatment1
0.514
1000
treatment2
0.559
1000
treatment3
0.600
1000
All
0.546
4000
[13]:
# Split data to training and testing samples for model validation (next section)
df_train, df_test = train_test_split(df, test_size=0.2, random_state=111)
# Train uplift tree
uplift_model = UpliftTreeClassifier(max_depth = 3, min_samples_leaf = 200, min_samples_treatment = 50, n_reg = 100, evaluationFunction='KL', control_name='control')
uplift_model.fit(df_train[x_names].values,
treatment=df_train['treatment_group_key'].values,
y=df_train['conversion'].values)
[14]:
# Plot uplift tree
# The uplift score represents the best uplift score among all treatment effects
graph = uplift_tree_plot(uplift_model.fitted_uplift_tree,x_names)
Image(graph.create_png())
[14]:
[15]:
# Save the graph as pdf
graph.write_pdf("tbc.pdf")
# Save the graph as png
graph.write_png("tbc.png")
[15]:
True
Model Interpretation with Feature Importance and SHAP Values
[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from causalml.inference.meta import BaseSRegressor, BaseTRegressor, BaseXRegressor, BaseRRegressor
from causalml.inference.tree import UpliftTreeClassifier, UpliftRandomForestClassifier
from causalml.dataset.regression import synthetic_data
from sklearn.linear_model import LinearRegression
import shap
import matplotlib.pyplot as plt
import time
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import os
import warnings
warnings.filterwarnings('ignore')
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True' # for lightgbm to work
%reload_ext autoreload
%autoreload 2
%matplotlib inline
The sklearn.utils.testing module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
[2]:
plt.style.use('fivethirtyeight')
[4]:
n_features = 25
n_samples = 10000
y, X, w, tau, b, e = synthetic_data(mode=1, n=n_samples, p=n_features, sigma=0.5)
[5]:
w_multi = np.array(['treatment_A' if x==1 else 'control' for x in w])
e_multi = {'treatment_A': e}
# Plot shap values without specifying shap_dict
slearner.plot_shap_values(X=X, tau=slearner_tau, features=feature_names)
[22]:
# Plot shap values WITH specifying shap_dict
slearner.plot_shap_values(X=X, shap_dict=shap_slearner)
[23]:
# interaction_idx set to None (no color coding for interaction effects)
slearner.plot_shap_dependence(treatment_group='treatment_A',
feature_idx=1,
X=X,
tau=slearner_tau,
interaction_idx=None,
shap_dict=shap_slearner)
[24]:
# interaction_idx set to 'auto' (searches for feature with greatest approximate interaction)
# specify feature names
slearner.plot_shap_dependence(treatment_group='treatment_A',
feature_idx='tiger',
X=X,
tau=slearner_tau,
interaction_idx='auto',
shap_dict=shap_slearner,
features=feature_names)
[25]:
# interaction_idx set to specific index
slearner.plot_shap_dependence(treatment_group='treatment_A',
feature_idx=1,
X=X,
tau=slearner_tau,
interaction_idx=10,
shap_dict=shap_slearner,
features=feature_names)
Note that uplift trees/forests are only implemented for classification at the moment, hence the following section uses a different synthetic data generation process.
This notebook demonstrates the issue of using uplift curves without knowing true treatment effect and how to solve it by using TMLE as a proxy of the true treatment effect.
If true treatment effect is known as in simulations, the uplift curve of a model uses the cumulative sum of the treatment effect sorted by model’s CATE estimate.
In the figure below, the uplift curve of X-learner shows positive lift close to the optimal lift by the ground truth.
If true treatment effect is unknown as in practice, the uplift curve of a model uses the cumulative mean difference of outcome in the treatment and control group sorted by model’s CATE estimate.
In the figure below, the uplift curves of X-learner as well as the ground truth show no lift incorrectly.
DragonNet vs Meta-Learners Benchmark with IHDP + Synthetic Datasets
Dragonnet requires tensorflow. If you haven’t, please install tensorflow as follows:
pip install tensorflow
[1]:
%load_ext autoreload
%autoreload 2
[2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error as mse
from scipy.stats import entropy
import warnings
from causalml.inference.meta import LRSRegressor
from causalml.inference.meta import XGBTRegressor, MLPTRegressor
from causalml.inference.meta import BaseXRegressor, BaseRRegressor, BaseSRegressor, BaseTRegressor
from causalml.inference.tf import DragonNet
from causalml.match import NearestNeighborMatch, MatchOptimizer, create_table_one
from causalml.propensity import ElasticNetPropensityModel
from causalml.dataset.regression import *
from causalml.metrics import *
import os, sys
%matplotlib inline
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')
sns.set_palette('Paired')
plt.rcParams['figure.figsize'] = (12,8)
Hill introduced a semi-synthetic dataset constructed from the Infant Health and Development Program (IHDP). This dataset is based on a randomized experiment investigating the effect of home visits by specialists on future cognitive scores. The data has 747 observations (rows). The IHDP simulation is considered the de-facto standard benchmark for neural network treatment effect estimation methods.
The original paper uses 1000 realizations from the NCPI package, but for illustration purposes, we use 1 dataset (realization) as an example below.
[3]:
df = pd.read_csv(f'data/ihdp_npci_3.csv', header=None)
cols = ["treatment", "y_factual", "y_cfactual", "mu0", "mu1"] + [f'x{i}' for i in range(1,26)]
df.columns = cols
The data generation mechanism is described in Syrgkanis et al “Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments” (2019).
We provided five methods for sensitivity analysis including (Placebo Treatment, Random Cause, Subset Data, Random Replace and Selection Bias). This notebook will walkthrough how to use the combined function sensitivity_analysis() to compare different method and also how to use each individual method separately:
Placebo Treatment: Replacing treatment with a random variable
Irrelevant Additional Confounder: Adding a random common cause variable
Subset validation: Removing a random subset of the data
Selection Bias method with One Sided confounding function and Alignment confounding function
Random Replace: Random replace a covariate with an irrelevant variable
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import warnings
import matplotlib
from causalml.inference.meta import BaseXLearner
from causalml.dataset import synthetic_data
from causalml.metrics.sensitivity import Sensitivity
from causalml.metrics.sensitivity import SensitivityRandomReplace, SensitivitySelectionBias
plt.style.use('fivethirtyeight')
matplotlib.rcParams['figure.figsize'] = [8, 8]
warnings.filterwarnings('ignore')
# logging.basicConfig(level=logging.INFO)
pd.options.display.float_format = '{:.4f}'.format
/Users/jing.pan/anaconda3/envs/causalml_3_6/lib/python3.6/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.utils.testing module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
warnings.warn(message, FutureWarning)
Sensitivity Analysis Summary Report (with One-sided confounding function and default alpha)
[9]:
# Calling the Base XLearner class and return the sensitivity analysis summary report
learner_x = BaseXLearner(LinearRegression())
sens_x = Sensitivity(df=df, inference_features=INFERENCE_FEATURES, p_col='pihat',
treatment_col=TREATMENT_COL, outcome_col=OUTCOME_COL, learner=learner_x)
# Here for Selection Bias method will use default one-sided confounding function and alpha (quantile range of outcome values) input
sens_sumary_x = sens_x.sensitivity_analysis(methods=['Placebo Treatment',
'Random Cause',
'Subset Data',
'Random Replace',
'Selection Bias'], sample_size=0.5)
[10]:
# From the following results, refutation methods show our model is pretty robust;
# When alpah > 0, the treated group always has higher mean potential outcomes than the control; when < 0, the control group is better off.
sens_sumary_x
# Plot the results by confounding vector and plot Confidence Intervals for ATE
sens_x_bias_alignment.plot(lls_x_bias_alignment, ci=True)
[17]:
# Plot the results by rsquare with partial r-square results by each individual features
sens_x_bias_alignment.plot(lls_x_bias_alignment, partial_rsqs_x_bias_alignment, type='r.squared', partial_rsqs=True)
# Plot the results by confounding vector and plot Confidence Intervals for ATE
sens_x_bias_alignment_new.plot(lls_x_bias_alignment_new, ci=True)
[28]:
# Plot the results by rsquare with partial r-square results by each individual features
sens_x_bias_alignment_new.plot(lls_x_bias_alignment_new, partial_rsqs_x_bias_alignment_new, type='r.squared', partial_rsqs=True)
df_new_2 = df.copy()
df_new_2['treated_new'] = df['feature_0'].rank()
df_new_2['treated_new'] = [1 if i > df_new_2.shape[0]/2 else 0 for i in df_new_2['treated_new']]
[30]:
df_new_2.head()
[30]:
feature_0
feature_1
feature_2
feature_3
feature_4
feature_5
target
outcome
pihat
treated_new
0
0.9536
0.2911
0.0432
0.8720
0.5190
0.0822
1
2.0220
0.7657
1
1
0.2390
0.3096
0.5115
0.2048
0.8914
0.5015
0
-0.0732
0.2304
0
2
0.1091
0.0765
0.7428
0.6951
0.4580
0.7800
0
-1.4947
0.1000
0
3
0.2055
0.3967
0.6278
0.2086
0.3865
0.8860
0
0.6458
0.2533
0
4
0.4501
0.0578
0.3972
0.4100
0.5760
0.4764
0
-0.0018
0.1000
0
Sensitivity Analysis Summary Report (with One-sided confounding function and default alpha)
[31]:
sens_x_new_2 = Sensitivity(df=df_new_2, inference_features=INFERENCE_FEATURES, p_col='pihat',
treatment_col='treated_new', outcome_col=OUTCOME_COL, learner=learner_x)
# Here for Selection Bias method will use default one-sided confounding function and alpha (quantile range of outcome values) input
sens_sumary_x_new_2 = sens_x_new_2.sensitivity_analysis(methods=['Placebo Treatment',
'Random Cause',
'Subset Data',
'Random Replace',
'Selection Bias'], sample_size=0.5)
# Plot the results by confounding vector and plot Confidence Intervals for ATE
sens_x_bias_alignment_new_2.plot(lls_x_bias_alignment_new_2, ci=True)
[39]:
# Plot the results by rsquare with partial r-square results by each individual features
sens_x_bias_alignment_new_2.plot(lls_x_bias_alignment_new, partial_rsqs_x_bias_alignment_new_2, type='r.squared', partial_rsqs=True)
Unit Selection Based on Counterfactual Logic by Li and Pearl (2019)
Causal ML contains an experimental version of the counterfactual unit selection method proposed by Li and Pearl (2019). The method has not been extensively tested or optimised so the user should proceed with caution. This notebook demonstrates the basic use of the counterfactual unit selector.
[2]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from causalml.dataset import make_uplift_classification
from causalml.optimize import CounterfactualUnitSelector
from causalml.optimize import get_treatment_costs
from causalml.optimize import get_actual_value
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline
In the context of a simple two-armed experiment, the counterfactual unit selection approach considers the following four segments of individuals:
Never-takers: those who will not convert whether or not they are in the treatment
Always-takers: those who will convert whether or not they are in the treatment
Compliers: those who will convert if they are in the treatment and will not convert if they are in the control
Defiers: those who will convert if they are in the control and will not convert if they are in the treatment
If we assume that the payoff from conversion is $20 and the conversion cost of a treatment is $2.5, then we can calculate the payoffs for targeting each type of individual as follows. For nevertakers, the payoff is always $0 because they will not convert or use a promotion. For alwaystakers, the payoff is -$2.5 because they would convert anyway but now we additionally give them a treatment worth $2.5. For compliers, the payoff is the benefit from conversion minus the cost of the treatment, and
for defiers the payoff is -$20 because they would convert if we didn’t treat them.
In this section we run the CounterfactualUnitSelector model and compare its performance against random assignment and a scheme in which all units are assigned to the treatment that has the best conversion in the training set. We measure the performance by looking at the average actual value payoff from those units in the testing set who happen to be in the treatment group recommended by each approach.
[7]:
# Specify the same costs as above but in a different form
tc_dict = {'control': 0, 'treatment': 2.5}
ic_dict = {'control': 0, 'treatment': 0}
conversion_value = np.full(df.shape[0], 20)
# Use the above information to get the cost of each treatment
cc_array, ic_array, conditions = get_treatment_costs(
treatment=df['treatment_group_key'], control_name='control',
cc_dict=tc_dict, ic_dict=ic_dict)
# Get the actual value of having a unit in their actual treatment
actual_value = get_actual_value(treatment=df['treatment_group_key'],
observed_outcome=df['conversion'],
conversion_value=conversion_value,
conditions=conditions,
conversion_cost=cc_array,
impression_cost=ic_array)
# Get the outcome if treatments were allocated randomly
random_allocation_value = actual_value.loc[test_idx].mean()
# Get the actual value of those individuals who are in the best
# treatment group
best_ate = df_train.groupby(
'treatment_group_key')['conversion'].mean().idxmax()
actual_is_best_ate = df_test['treatment_group_key'] == best_ate
best_ate_value = actual_value.loc[test_idx][actual_is_best_ate].mean()
The goal in uplift modeling is usually to predict the best treatment condition for an individual. Most of the time, the best treatment condition is assumed to be the one that has the highest probability of some “conversion event” such as the individual’s purchasing a product. This is the traditional approach in which the goal is to maximize conversion.
However, if the goal of uplift modeling is to maximize value, then it is not safe to assume that the best treatment group is the one with the highest expected conversion. For example, it might be that the payoff from conversion is not sufficient to offset the cost of the treatment, or it might be that the treatment targets individuals who would convert anyway (Li and Pearl 2019). Therefore, it is often important to conduct some kind of value
optimization together with uplift modeling, in order to determine the treatment group with the best value, not just the best lift.
The Causal ML package includes the CounterfactualValueEstimator class to conduct simple imputation-based value optimization. This notebook demonstrates the use of CounterfactualValueEstimator to determine the best treatment group when the costs of treatments are taken into account. We consider two kinds of costs:
Conversion costs are those that we must endure if an individual who is in the treatment group converts. A typical example would be the cost of a promotional voucher.
Impression costs are those that we need to pay for each individual in the treatment group irrespective of whether they convert. A typical example would be the cost associated with sending an SMS or email.
The proposed method takes two inputs: the CATE estimate \(\hat{\tau}\) learned by any suitable method, and the predicted outcome for an individual learned by what we call the conversion probability model that estimates the conditional probability of conversion \(P(Y=1 \mid X=x, W=x)\) where \(W\) is the treatment group indicator. That is, the model estimates the probability of conversion for each individual using their observed pre-treatment features \(X\). The output of this
model is then combined with the predicted CATE in order to impute the expected conversion probability for each individual under \textit{each treatment condition} as follows:
The fact that we impute the conversion probability under each experimental condition–the actual as well as the counterfactual–gives our method its name. Using the estimated conversion probabilities, we then compute the expected payoff under each treatment condition while taking into account the value of conversion and the conversion and impression costs associated with each treatment, as follows (see Zhao and Harinen (2019) for more details):
where \(cc_t\) and \(ic_t\) are the conversion costs and impression costs, respectively.
[2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
from causalml.dataset import make_uplift_classification
from causalml.inference.meta import BaseTClassifier
from causalml.optimize import CounterfactualValueEstimator
from causalml.optimize import get_treatment_costs
from causalml.optimize import get_actual_value
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
The sklearn.utils.testing module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
sklearn.tree._criterion.RegressionCriterion size changed, may indicate binary incompatibility. Expected 168 from C header, got 360 from PyObject
sklearn.tree._criterion.Criterion size changed, may indicate binary incompatibility. Expected 160 from C header, got 352 from PyObject
sklearn.tree._criterion.ClassificationCriterion size changed, may indicate binary incompatibility. Expected 176 from C header, got 368 from PyObject
In this example, we assume there are no costs associated with assigning units into the control group, and that for the two treatment groups the conversion cost are \$2.5 and \$5, respectively. We assume the impression costs to be zero for one of the treatments and \$0.02 for the other. We also specify the payoff, which we here assume to be the same for everyone, \$20. However, these values could vary from individual to individual.
[4]:
# Put costs into dicts
conversion_cost_dict = {'control': 0, 'treatment1': 2.5, 'treatment2': 5}
impression_cost_dict = {'control': 0, 'treatment1': 0, 'treatment2': 0.02}
# Use a helper function to put treatment costs to array
cc_array, ic_array, conditions = get_treatment_costs(treatment=df['treatment_group_key'],
control_name='control',
cc_dict=conversion_cost_dict,
ic_dict=impression_cost_dict)
# Put the conversion value into an array
conversion_value_array = np.full(df.shape[0], 20)
Next we calculate the value of actually having an individual in their actual treatment group using the equation for expected value under a treatment, ie:
# Use a helper function to obtain the value of actual treatment
actual_value = get_actual_value(treatment=df['treatment_group_key'],
observed_outcome=df['conversion'],
conversion_value=conversion_value_array,
conditions=conditions,
conversion_cost=cc_array,
impression_cost=ic_array)
A common problem in the uplift modeling literature is that of evaluating the quality of the treatment recommendations produced by a model. The evaluation of uplift models is tricky because we do not observe treatment effects at an individual level directly in non-simulated data, so it is not possible to use standard model evaluation metrics such as mean squared error. Consequently, various authors have proposed various ways to work around this issue. For example, Schuler et al
(2018) identify seven different evaluation strategies used in the literature.
Below, we use the approach of model evaluation put forward by Kaepelner et al (2014). The idea in this method is to evaluate the improvement we would gain if we targeted some as-yet untreated future population by using the recommendations produced by a particular model. To do so, we split the data into disjoint training and testing sets, and train our model on the training data. We then use the model to predict the best treatment group for units in the
testing data, which in a simple two-arm trial is either treatment or control. In order to estimate the outcome for the future population if the model were to be used, we then select a subset of the testing data based on whether their observed treatment allocation happens to be the same as the one recommended by the model. This population is called “lucky”.
Predicted best treatment
Actual treatment
Lucky
Control
Control
Yes
Control
Treatment
No
Treatment
Treatment
Yes
Treatment
Control
No
The average outcome for the “lucky” population can be taken to represent what the outcome would be for a future untreated population if we were to use the uplift model in question to allocate treatments. Recall that in all of the experiments the treatments are assumed to have been allocated randomly across the total population, so there should be no selection bias. The average outcome under a given model can then be compared with alternative treatment allocation strategies. As Kaepelner et al
(2014) point out, two common strategies are random allocation and “best treatment” allocation. To estimate what the outcome for a future population would be under random allocation, we can simply look at the sample mean across the total test population. To estimate the same for the “best treatment” assignment, we can look at those units in the test set whose observed treatment assignment corresponds to the treatment group with the best average treatment
effect. These alternative targeting strategies are interesting because they are a common practice in industry applications and elsewhere.
In this section, we compare four different targeting strategies:
Random treatment allocation under which all units in the testing set are randomly assigned to treatments
The “best treatment” allocation under which all units in the testing set are assigned to the treatment with the best conversion in the training set
Allocation under an uplift model in which all units in the testing set are assigned to the treatment which is predicted to have the highest conversion rate according to an uplift model trained on the training set
Allocation under the counterfactual value estimator model in which all units are assigned to the treatment group with the best predicted payoff
# Calculate the benchmark value according to the random allocation
# and best treatment schemes
random_allocation_value = actual_value.loc[test_idx].mean()
best_ate = df_train.groupby(
'treatment_group_key')['conversion'].mean().idxmax()
actual_is_best_ate = df_test['treatment_group_key'] == best_ate
best_ate_value = actual_value.loc[test_idx][actual_is_best_ate].mean()
[9]:
# Calculate the value under an uplift model
tm = BaseTClassifier(control_learner=xgb.XGBClassifier(),
treatment_learner=xgb.XGBClassifier(),
control_name='control')
tm.fit(df_train[X_names].values,
df_train['treatment_group_key'],
df_train['conversion'])
tm_pred = tm.predict(df_test[X_names].values)
pred_df = pd.DataFrame(tm_pred, columns=tm._classes)
tm_best = pred_df.idxmax(axis=1)
actual_is_tm_best = df_test['treatment_group_key'] == tm_best.ravel()
tm_value = actual_value.loc[test_idx][actual_is_tm_best].mean()
[10]:
# Estimate the conditional mean model; this is a pure curve
# fitting exercise
proba_model = xgb.XGBClassifier()
W_dummies = pd.get_dummies(df['treatment_group_key'])
XW = np.c_[df[X_names], W_dummies]
proba_model.fit(XW[train_idx], df_train['conversion'])
y_proba = proba_model.predict_proba(XW[test_idx])[:, 1]
[11]:
# Run the counterfactual calculation with TwoModel prediction
cve = CounterfactualValueEstimator(treatment=df_test['treatment_group_key'],
control_name='control',
treatment_names=conditions[1:],
y_proba=y_proba,
cate=tm_pred,
value=conversion_value_array[test_idx],
conversion_cost=cc_array[test_idx],
impression_cost=ic_array[test_idx])
cve_best_idx = cve.predict_best()
cve_best = [conditions[idx] for idx in cve_best_idx]
actual_is_cve_best = df.loc[test_idx, 'treatment_group_key'] == cve_best
cve_value = actual_value.loc[test_idx][actual_is_cve_best].mean()
[12]:
labels = [
'Random allocation',
'Best treatment',
'T-Learner',
'CounterfactualValueEstimator'
]
values = [
random_allocation_value,
best_ate_value,
tm_value,
cve_value
]
plt.bar(labels, values)
plt.ylabel('Mean actual value in testing set')
plt.xticks(rotation=45)
plt.show()
Here, only CounterfactualValueEstimator improves upon random targeting. The “best treatment” and T-Learner approaches likely perform worse because they recommend costly treatments to individuals who would convert anyway.
Feature Selection for Uplift Trees by Zhao et al. (2020)
This notebook includes two sections:
- Feature selection: demonstrate how to use Filter methods to select the most important numeric features - Performance evaluation: evaluate the AUUC performance with top features dataset
INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
[9]:
df.head()
[9]:
treatment_group_key
x1_informative
x2_informative
x3_informative
x4_informative
x5_informative
x6_informative
x7_informative
x8_informative
x9_informative
...
x56_uplift_increase
x57_uplift_increase
x58_uplift_increase
x59_increase_mix
x60_uplift_decrease
x61_uplift_decrease
x62_uplift_decrease
x63_uplift_decrease
conversion
treatment_effect
0
treatment1
-4.004496
-1.250351
-2.800557
-0.368288
-0.115549
-2.492826
0.369516
0.290526
0.465153
...
0.496144
1.847680
-0.337894
-0.672058
1.180352
0.778013
0.931000
2.947160
0
0
1
treatment1
-3.170028
-0.135293
1.484246
-2.131584
-0.760103
1.764765
0.972124
1.407131
-1.027603
...
0.574955
3.578138
0.678118
-0.545227
-0.143942
-0.015188
1.189643
1.943692
1
0
2
treatment1
-0.763386
-0.785612
1.218781
-0.725835
1.044489
-1.521071
-2.266684
-1.614818
-0.113647
...
0.985076
1.079181
0.578092
0.574370
-0.477429
0.679070
1.650897
2.768897
1
0
3
control
0.887727
0.049095
-2.242776
1.530966
0.392623
-0.203071
-0.549329
0.107296
-0.542277
...
-0.175352
0.683330
0.567545
0.349622
-0.789203
2.315184
0.658607
1.734836
0
0
4
control
-1.672922
-1.156145
3.871476
-1.883713
-0.220122
-4.615669
0.141980
-0.933756
-0.765592
...
0.485798
-0.355315
0.982488
-0.007260
2.895155
0.261848
-1.337001
-0.639983
1
0
5 rows × 66 columns
[10]:
# Look at the conversion rate and sample size in each group
df.pivot_table(values='conversion',
index='treatment_group_key',
aggfunc=[np.mean, np.size],
margins=True)
# using all features
features = X_names
uplift_model.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds = uplift_model.predict(df_test[features].values)
# using top 10 features
features = top_10_features
uplift_model.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds_t10 = uplift_model.predict(df_test[features].values)
[28]:
# using top 15 features
features = top_15_features
uplift_model.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds_t15 = uplift_model.predict(df_test[features].values)
[29]:
# using top 20 features
features = top_20_features
uplift_model.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds_t20 = uplift_model.predict(df_test[features].values)
# using all features
features = X_names
r_rf_learner.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds = r_rf_learner.predict(df_test[features].values)
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment1 with R-loss
[34]:
# using top 10 features
features = top_10_features
r_rf_learner.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds_t10 = r_rf_learner.predict(df_test[features].values)
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment1 with R-loss
[35]:
# using top 15 features
features = top_15_features
r_rf_learner.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds_t15 = r_rf_learner.predict(df_test[features].values)
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment1 with R-loss
[36]:
# using top 20 features
features = top_20_features
r_rf_learner.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds_t20 = r_rf_learner.predict(df_test[features].values)
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment1 with R-loss
# using all features
features = X_names
slearner_rf.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds = slearner_rf.predict(df_test[features].values)
[41]:
# using top 10 features
features = top_10_features
slearner_rf.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds_t10 = slearner_rf.predict(df_test[features].values)
[42]:
# using top 15 features
features = top_15_features
slearner_rf.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds_t15 = slearner_rf.predict(df_test[features].values)
[43]:
# using top 20 features
features = top_20_features
slearner_rf.fit(X = df_train[features].values,
treatment = df_train['treatment_group_key'].values,
y = df_train[y_name].values)
y_preds_t20 = slearner_rf.predict(df_test[features].values)
# print out AUUC score
auuc_score(df_preds, outcome_col=y_name, treatment_col='is_treated')
[45]:
All 0.824483
Top 10 0.832872
Top 15 0.817835
Top 20 0.816149
Random 0.506801
dtype: float64
In this notebook, we demonstrated how our Filter method functions are able to select important features and enhance the AUUC performance (while the results might vary among different datasets, models and hyper-parameters).
[ ]:
Policy Learner by Athey and Wager (2018) with Binary Treatment
This notebook demonstrates the use of the CausalML implementation of the policy learner by Athey and Wager (2018) (https://arxiv.org/abs/1702.02896).
[1]:
%load_ext autoreload
%autoreload 2
[2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
[3]:
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
[4]:
from causalml.optimize import PolicyLearner
from sklearn.tree import plot_tree
from lightgbm import LGBMRegressor
from causalml.inference.meta import BaseXRegressor
---------------------------------------------------------------------------RuntimeError Traceback (most recent call last)
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
The sklearn.utils.testing module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
First we generate a synthetic data set with binary treatment. The treatment is random conditioned on covariates. The treatment effect is heterogeneous where for some individuals it is negative. We use a policy learner to classify the individuals into treat/no-treat groups to maximize the total treatment effect.
[5]:
np.random.seed(1234)
n = 10000
p = 10
X = np.random.normal(size=(n, p))
ee = 1 / (1 + np.exp(X[:, 2]))
tt = 1 / (1 + np.exp(X[:, 0] + X[:, 1])/2) - 0.5
W = np.random.binomial(1, ee, n)
Y = X[:, 2] + W * tt + np.random.normal(size=n)
Use policy learner with default outcome/treatment estimator and a simple policy classifier.
CEVAE vs. Meta-Learners Benchmark with IHDP + Synthetic Datasets
[1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import torch
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error as mse
from scipy.stats import entropy
import warnings
import logging
from causalml.inference.meta import BaseXRegressor, BaseRRegressor, BaseSRegressor, BaseTRegressor
from causalml.inference.nn import CEVAE
from causalml.propensity import ElasticNetPropensityModel
from causalml.metrics import *
from causalml.dataset import simulate_hidden_confounder
%matplotlib inline
warnings.filterwarnings('ignore')
logger = logging.getLogger('causalml')
logger.setLevel(logging.DEBUG)
plt.style.use('fivethirtyeight')
sns.set_palette('Paired')
plt.rcParams['figure.figsize'] = (12,8)
Hill introduced a semi-synthetic dataset constructed from the Infant Health and Development Program (IHDP). This dataset is based on a randomized experiment investigating the effect of home visits by specialists on future cognitive scores. The IHDP simulation is considered the de-facto standard benchmark for neural network treatment effect estimation methods.
[2]:
# load all ihadp data
df = pd.DataFrame()
for i in range(1, 10):
data = pd.read_csv('./data/ihdp_npci_' + str(i) + '.csv', header=None)
df = pd.concat([data, df])
cols = ["treatment", "y_factual", "y_cfactual", "mu0", "mu1"] + [i for i in range(25)]
df.columns = cols
print(df.shape)
# replicate the data 100 times
replications = 100
df = pd.concat([df]*replications, ignore_index=True)
print(df.shape)
(6723, 30)
(672300, 30)
[3]:
# set which features are binary
binfeats = [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
# set which features are continuous
contfeats = [i for i in range(25) if i not in binfeats]
# reorder features with binary first and continuous after
perm = binfeats + contfeats
[4]:
df = df.reset_index(drop=True)
df.head()
[4]:
treatment
y_factual
y_cfactual
mu0
mu1
0
1
2
3
4
...
15
16
17
18
19
20
21
22
23
24
0
1
49.647921
34.950762
37.173291
50.383798
-0.528603
-0.343455
1.128554
0.161703
-0.316603
...
1
1
1
1
0
0
0
0
0
0
1
0
16.073412
49.435313
16.087249
49.546234
-1.736945
-1.802002
0.383828
2.244320
-0.629189
...
1
1
1
1
0
0
0
0
0
0
2
0
19.643007
48.598210
18.044855
49.661068
-0.807451
-0.202946
-0.360898
-0.879606
0.808706
...
1
0
1
1
0
0
0
0
0
0
3
0
26.368322
49.715204
24.605964
49.971196
0.390083
0.596582
-1.850350
-0.879606
-0.004017
...
1
0
1
1
0
0
0
0
0
0
4
0
20.258893
51.147418
20.612816
49.794120
-1.045229
-0.602710
0.011465
0.161703
0.683672
...
1
1
1
1
0
0
0
0
0
0
5 rows × 30 columns
[5]:
X = df[perm].values
treatment = df['treatment'].values
y = df['y_factual'].values
y_cf = df['y_cfactual'].values
tau = df.apply(lambda d: d['y_factual'] - d['y_cfactual'] if d['treatment']==1
else d['y_cfactual'] - d['y_factual'],
axis=1)
mu_0 = df['mu0'].values
mu_1 = df['mu1'].values
INFO Evaluating 538 minibatches
DEBUG batch ate = 0.62191
DEBUG batch ate = 0.613137
DEBUG batch ate = 0.688279
DEBUG batch ate = 0.530233
DEBUG batch ate = 0.814089
DEBUG batch ate = 0.623182
DEBUG batch ate = 0.657884
DEBUG batch ate = 0.594205
DEBUG batch ate = 0.319953
DEBUG batch ate = 0.557599
DEBUG batch ate = 0.718177
DEBUG batch ate = 0.441256
DEBUG batch ate = 0.654653
DEBUG batch ate = 0.70725
DEBUG batch ate = 0.715862
DEBUG batch ate = 0.193786
DEBUG batch ate = 0.557451
DEBUG batch ate = 0.788378
DEBUG batch ate = 0.605489
DEBUG batch ate = 0.669786
DEBUG batch ate = 0.852794
DEBUG batch ate = 0.755987
DEBUG batch ate = 0.510262
DEBUG batch ate = 0.502153
DEBUG batch ate = 0.254691
DEBUG batch ate = 0.369999
DEBUG batch ate = 0.59401
DEBUG batch ate = 0.608015
DEBUG batch ate = 0.661765
DEBUG batch ate = 0.25462
DEBUG batch ate = 0.771231
DEBUG batch ate = 0.530303
DEBUG batch ate = 0.566246
DEBUG batch ate = 0.683882
DEBUG batch ate = 0.616635
DEBUG batch ate = 0.324804
DEBUG batch ate = 0.383451
DEBUG batch ate = 0.690402
DEBUG batch ate = 0.558513
DEBUG batch ate = 0.618007
DEBUG batch ate = 0.551096
DEBUG batch ate = 0.462644
DEBUG batch ate = 0.615761
DEBUG batch ate = 0.543891
DEBUG batch ate = 0.432806
DEBUG batch ate = 0.562174
DEBUG batch ate = 0.654926
DEBUG batch ate = 0.421796
DEBUG batch ate = 0.719893
DEBUG batch ate = 0.454017
DEBUG batch ate = 0.699385
DEBUG batch ate = 0.54048
DEBUG batch ate = 0.333772
DEBUG batch ate = 0.737522
DEBUG batch ate = 0.5696
DEBUG batch ate = 0.467629
DEBUG batch ate = 0.601579
DEBUG batch ate = 0.509313
DEBUG batch ate = 0.385523
DEBUG batch ate = 0.510085
DEBUG batch ate = 0.661952
DEBUG batch ate = 0.600664
DEBUG batch ate = 0.066584
DEBUG batch ate = 0.552528
DEBUG batch ate = 0.467475
DEBUG batch ate = 0.539326
DEBUG batch ate = 0.694311
DEBUG batch ate = 0.198014
DEBUG batch ate = 0.61709
DEBUG batch ate = 0.408558
DEBUG batch ate = 0.684187
DEBUG batch ate = 0.447501
DEBUG batch ate = 0.347885
DEBUG batch ate = 0.561035
DEBUG batch ate = 0.617192
DEBUG batch ate = 0.81278
DEBUG batch ate = 0.61961
DEBUG batch ate = 1.01213
DEBUG batch ate = 0.345585
DEBUG batch ate = 0.51818
DEBUG batch ate = 0.436719
DEBUG batch ate = 0.604546
DEBUG batch ate = 0.706353
DEBUG batch ate = 0.661419
DEBUG batch ate = 0.787418
DEBUG batch ate = 0.61231
DEBUG batch ate = 0.629355
DEBUG batch ate = 0.550861
DEBUG batch ate = 0.472948
DEBUG batch ate = 0.594738
DEBUG batch ate = 0.844747
DEBUG batch ate = 0.682486
DEBUG batch ate = 0.607738
DEBUG batch ate = 0.49322
DEBUG batch ate = 0.547857
DEBUG batch ate = 0.255665
DEBUG batch ate = 0.564768
DEBUG batch ate = 0.34345
DEBUG batch ate = 0.40075
DEBUG batch ate = 0.72982
DEBUG batch ate = 0.878728
DEBUG batch ate = 0.860621
DEBUG batch ate = 0.544359
DEBUG batch ate = 0.777127
DEBUG batch ate = 0.590297
DEBUG batch ate = 0.880415
DEBUG batch ate = 0.67375
DEBUG batch ate = 0.784914
DEBUG batch ate = 0.511374
DEBUG batch ate = 0.327954
DEBUG batch ate = 0.628989
DEBUG batch ate = 0.529468
DEBUG batch ate = 0.688235
DEBUG batch ate = 0.872871
DEBUG batch ate = 0.3485
DEBUG batch ate = 0.572016
DEBUG batch ate = 0.565154
DEBUG batch ate = 0.588927
DEBUG batch ate = 0.520636
DEBUG batch ate = 0.345301
DEBUG batch ate = 0.611386
DEBUG batch ate = 0.702772
DEBUG batch ate = 0.764302
DEBUG batch ate = 0.638517
DEBUG batch ate = 0.498749
DEBUG batch ate = 0.922372
DEBUG batch ate = 0.648347
DEBUG batch ate = 0.930839
DEBUG batch ate = 0.841956
DEBUG batch ate = 0.687886
DEBUG batch ate = 0.804776
DEBUG batch ate = 0.550305
DEBUG batch ate = 0.625526
DEBUG batch ate = 0.856957
DEBUG batch ate = 0.470616
DEBUG batch ate = 0.507122
DEBUG batch ate = 0.358198
DEBUG batch ate = 0.6335
DEBUG batch ate = 0.473881
DEBUG batch ate = 0.415356
DEBUG batch ate = 0.309733
DEBUG batch ate = 0.290068
DEBUG batch ate = 0.470317
DEBUG batch ate = 0.668486
DEBUG batch ate = 0.580281
DEBUG batch ate = 0.772137
DEBUG batch ate = 0.490976
DEBUG batch ate = 0.511012
DEBUG batch ate = 0.441551
DEBUG batch ate = 0.575225
DEBUG batch ate = 0.591247
DEBUG batch ate = 0.368313
DEBUG batch ate = 0.350138
DEBUG batch ate = 0.603038
DEBUG batch ate = 0.241947
DEBUG batch ate = 0.599275
DEBUG batch ate = 0.41003
DEBUG batch ate = 0.447525
DEBUG batch ate = 0.79099
DEBUG batch ate = 0.506499
DEBUG batch ate = 0.61826
DEBUG batch ate = 0.651964
DEBUG batch ate = 0.52761
DEBUG batch ate = 0.888067
DEBUG batch ate = 0.367077
DEBUG batch ate = 0.524761
DEBUG batch ate = 0.6165
DEBUG batch ate = 0.72863
DEBUG batch ate = 0.516559
DEBUG batch ate = 0.385291
DEBUG batch ate = 0.660073
DEBUG batch ate = 0.465947
DEBUG batch ate = 0.586065
DEBUG batch ate = 0.533599
DEBUG batch ate = 0.916433
DEBUG batch ate = 0.658235
DEBUG batch ate = 0.770213
DEBUG batch ate = 0.634768
DEBUG batch ate = 0.887955
DEBUG batch ate = 0.374664
DEBUG batch ate = 0.649699
DEBUG batch ate = 0.550386
DEBUG batch ate = 0.516355
DEBUG batch ate = 0.425265
DEBUG batch ate = 0.264789
DEBUG batch ate = 0.775339
DEBUG batch ate = 0.636203
DEBUG batch ate = 0.507562
DEBUG batch ate = 0.885973
DEBUG batch ate = 0.951861
DEBUG batch ate = 0.370282
DEBUG batch ate = 0.69922
DEBUG batch ate = 0.956577
DEBUG batch ate = 0.789856
DEBUG batch ate = 0.726278
DEBUG batch ate = 0.165073
DEBUG batch ate = 0.530907
DEBUG batch ate = 0.602567
DEBUG batch ate = 0.682041
DEBUG batch ate = 0.54427
DEBUG batch ate = 0.787318
DEBUG batch ate = 0.491623
DEBUG batch ate = 0.794449
DEBUG batch ate = 0.928849
DEBUG batch ate = 0.771662
DEBUG batch ate = 0.722534
DEBUG batch ate = 0.611424
DEBUG batch ate = 0.754558
DEBUG batch ate = 0.466829
DEBUG batch ate = 0.623566
DEBUG batch ate = 0.595247
DEBUG batch ate = 0.790067
DEBUG batch ate = 0.218814
DEBUG batch ate = 0.551078
DEBUG batch ate = 0.561368
DEBUG batch ate = 0.823733
DEBUG batch ate = 0.725582
DEBUG batch ate = 0.685417
DEBUG batch ate = 0.573616
DEBUG batch ate = 0.408314
DEBUG batch ate = 0.420605
DEBUG batch ate = 0.699393
DEBUG batch ate = 0.485361
DEBUG batch ate = 0.470607
DEBUG batch ate = 0.672379
DEBUG batch ate = 0.515571
DEBUG batch ate = 0.837184
DEBUG batch ate = 0.383294
DEBUG batch ate = 0.631237
DEBUG batch ate = 0.660588
DEBUG batch ate = 0.454409
DEBUG batch ate = 0.277474
DEBUG batch ate = 1.08705
DEBUG batch ate = 0.542072
DEBUG batch ate = 0.667987
DEBUG batch ate = 0.474515
DEBUG batch ate = 0.462981
DEBUG batch ate = 0.581607
DEBUG batch ate = 0.539565
DEBUG batch ate = 0.740687
DEBUG batch ate = 0.672987
DEBUG batch ate = 0.725537
DEBUG batch ate = 0.683099
DEBUG batch ate = 0.695347
DEBUG batch ate = 0.533302
DEBUG batch ate = 0.625668
DEBUG batch ate = 0.744886
DEBUG batch ate = 0.686994
DEBUG batch ate = 0.572683
DEBUG batch ate = 0.431316
DEBUG batch ate = 0.521101
DEBUG batch ate = 0.651604
DEBUG batch ate = 0.514384
DEBUG batch ate = 0.471155
DEBUG batch ate = 0.759972
DEBUG batch ate = 0.633456
DEBUG batch ate = 0.52144
DEBUG batch ate = 0.675739
DEBUG batch ate = 0.713319
DEBUG batch ate = 0.749301
DEBUG batch ate = 0.637229
DEBUG batch ate = 0.690767
DEBUG batch ate = 0.638464
DEBUG batch ate = 0.804409
DEBUG batch ate = 0.379763
DEBUG batch ate = 0.939645
DEBUG batch ate = 0.566416
DEBUG batch ate = 0.722778
DEBUG batch ate = 0.875249
DEBUG batch ate = 0.585553
DEBUG batch ate = 0.452997
DEBUG batch ate = 0.660046
DEBUG batch ate = 0.523958
DEBUG batch ate = 0.743689
DEBUG batch ate = 0.281901
DEBUG batch ate = 0.79823
DEBUG batch ate = 0.501476
DEBUG batch ate = 0.27024
DEBUG batch ate = 0.661638
DEBUG batch ate = 0.530568
DEBUG batch ate = 0.276738
DEBUG batch ate = 0.734873
DEBUG batch ate = 0.547245
DEBUG batch ate = 0.642462
DEBUG batch ate = 0.69965
DEBUG batch ate = 0.544179
DEBUG batch ate = 0.501292
DEBUG batch ate = 0.782594
DEBUG batch ate = 0.718873
DEBUG batch ate = 0.53492
DEBUG batch ate = 0.602767
DEBUG batch ate = 0.642604
DEBUG batch ate = 0.899802
DEBUG batch ate = 0.345271
DEBUG batch ate = 0.408736
DEBUG batch ate = 0.503462
DEBUG batch ate = 0.548023
DEBUG batch ate = 0.869944
DEBUG batch ate = 0.712165
DEBUG batch ate = 0.840788
DEBUG batch ate = 0.802797
DEBUG batch ate = 0.448752
DEBUG batch ate = 0.489339
DEBUG batch ate = 0.760921
DEBUG batch ate = 0.549896
DEBUG batch ate = 0.337833
DEBUG batch ate = 0.489319
DEBUG batch ate = 0.349298
DEBUG batch ate = 0.0851573
DEBUG batch ate = 0.701312
DEBUG batch ate = 0.426929
DEBUG batch ate = 0.52591
DEBUG batch ate = 0.45672
DEBUG batch ate = 0.691007
DEBUG batch ate = 0.681652
DEBUG batch ate = 0.414373
DEBUG batch ate = 0.43001
DEBUG batch ate = 0.698964
DEBUG batch ate = 0.569967
DEBUG batch ate = 0.670148
DEBUG batch ate = 0.612077
DEBUG batch ate = 0.559155
DEBUG batch ate = 0.839547
DEBUG batch ate = 0.704653
DEBUG batch ate = 0.44604
DEBUG batch ate = 0.608618
DEBUG batch ate = 0.744417
DEBUG batch ate = 0.340019
DEBUG batch ate = 0.469705
DEBUG batch ate = 0.859227
DEBUG batch ate = 0.732652
DEBUG batch ate = 0.624253
DEBUG batch ate = 0.767217
DEBUG batch ate = 0.431167
DEBUG batch ate = 0.712165
DEBUG batch ate = 0.576947
DEBUG batch ate = 0.546332
DEBUG batch ate = 0.52999
DEBUG batch ate = 0.349895
DEBUG batch ate = 0.625377
DEBUG batch ate = 0.564784
DEBUG batch ate = 0.827983
DEBUG batch ate = 0.402039
DEBUG batch ate = 0.732634
DEBUG batch ate = 0.828913
DEBUG batch ate = 0.580144
DEBUG batch ate = 0.568022
DEBUG batch ate = 0.561761
DEBUG batch ate = 0.294596
DEBUG batch ate = 0.636919
DEBUG batch ate = 0.655477
DEBUG batch ate = 0.925995
DEBUG batch ate = 0.729636
DEBUG batch ate = 0.550091
DEBUG batch ate = 0.558647
DEBUG batch ate = 0.673149
DEBUG batch ate = 0.657379
DEBUG batch ate = 0.553136
DEBUG batch ate = 0.784905
DEBUG batch ate = 0.72343
DEBUG batch ate = 0.872444
DEBUG batch ate = 0.594647
DEBUG batch ate = 0.815522
DEBUG batch ate = 0.882869
DEBUG batch ate = 0.505135
DEBUG batch ate = 0.608259
DEBUG batch ate = 0.438947
DEBUG batch ate = 0.642148
DEBUG batch ate = 0.42703
DEBUG batch ate = 0.492255
DEBUG batch ate = 1.01806
DEBUG batch ate = 0.488789
DEBUG batch ate = 0.353427
DEBUG batch ate = 0.697426
DEBUG batch ate = 0.454108
DEBUG batch ate = 0.585995
DEBUG batch ate = 0.898554
DEBUG batch ate = 0.462355
DEBUG batch ate = 0.847193
DEBUG batch ate = 0.435861
DEBUG batch ate = 0.350475
DEBUG batch ate = 0.494122
DEBUG batch ate = 0.641375
DEBUG batch ate = 1.05287
DEBUG batch ate = 0.560613
DEBUG batch ate = 0.622122
DEBUG batch ate = 0.617646
DEBUG batch ate = 0.438831
DEBUG batch ate = 0.413241
DEBUG batch ate = 0.709999
DEBUG batch ate = 0.393058
DEBUG batch ate = 0.577082
DEBUG batch ate = 0.449773
DEBUG batch ate = 0.409307
DEBUG batch ate = 0.717688
DEBUG batch ate = 0.680811
DEBUG batch ate = 0.636654
DEBUG batch ate = 0.537257
DEBUG batch ate = 0.485248
DEBUG batch ate = 0.611201
DEBUG batch ate = 0.66029
DEBUG batch ate = 0.621785
DEBUG batch ate = 0.656557
DEBUG batch ate = 0.50069
DEBUG batch ate = 0.531677
DEBUG batch ate = 0.539529
DEBUG batch ate = 0.7621
DEBUG batch ate = 0.34175
DEBUG batch ate = 0.573927
DEBUG batch ate = 0.698847
DEBUG batch ate = 0.687271
DEBUG batch ate = 0.625974
DEBUG batch ate = 0.623745
DEBUG batch ate = 0.542737
DEBUG batch ate = 0.203161
DEBUG batch ate = 0.656258
DEBUG batch ate = 0.20316
DEBUG batch ate = 0.333921
DEBUG batch ate = 0.503528
DEBUG batch ate = 0.274319
DEBUG batch ate = 0.435086
DEBUG batch ate = 0.577274
DEBUG batch ate = 0.404617
DEBUG batch ate = 0.488066
DEBUG batch ate = 0.804592
DEBUG batch ate = 0.731865
DEBUG batch ate = 0.751529
DEBUG batch ate = 0.847831
DEBUG batch ate = 0.737108
DEBUG batch ate = 0.403549
DEBUG batch ate = 0.659598
DEBUG batch ate = 0.777456
DEBUG batch ate = 0.655091
DEBUG batch ate = 0.805262
DEBUG batch ate = 0.578173
DEBUG batch ate = 0.749979
DEBUG batch ate = 0.645467
DEBUG batch ate = 0.765642
DEBUG batch ate = 0.221318
DEBUG batch ate = 0.566684
DEBUG batch ate = 0.885021
DEBUG batch ate = 0.798495
DEBUG batch ate = 0.749958
DEBUG batch ate = 0.404101
DEBUG batch ate = 0.597844
DEBUG batch ate = 0.548862
DEBUG batch ate = 0.633423
DEBUG batch ate = 0.58442
DEBUG batch ate = 0.406284
DEBUG batch ate = 0.497425
DEBUG batch ate = 0.64323
DEBUG batch ate = 0.764823
DEBUG batch ate = 0.719326
DEBUG batch ate = 0.850669
DEBUG batch ate = 0.567251
DEBUG batch ate = 0.531746
DEBUG batch ate = 0.422011
DEBUG batch ate = 0.469137
DEBUG batch ate = 0.568481
DEBUG batch ate = 0.336506
DEBUG batch ate = 0.785506
DEBUG batch ate = 0.771601
DEBUG batch ate = 0.790584
DEBUG batch ate = 0.756722
DEBUG batch ate = 0.558484
DEBUG batch ate = 0.565823
DEBUG batch ate = 0.85092
DEBUG batch ate = 0.836311
DEBUG batch ate = 0.36647
DEBUG batch ate = 0.671067
DEBUG batch ate = 0.678834
DEBUG batch ate = 0.7427
DEBUG batch ate = 0.380171
DEBUG batch ate = 0.702751
DEBUG batch ate = 0.821684
DEBUG batch ate = 0.183044
DEBUG batch ate = 0.71705
DEBUG batch ate = 0.650429
DEBUG batch ate = 0.647615
DEBUG batch ate = 0.590948
DEBUG batch ate = 0.32329
DEBUG batch ate = 0.8901
DEBUG batch ate = 0.56427
DEBUG batch ate = 0.335077
DEBUG batch ate = 0.777793
DEBUG batch ate = 0.669449
DEBUG batch ate = 0.794569
DEBUG batch ate = 0.455826
DEBUG batch ate = 0.237244
DEBUG batch ate = 0.449816
DEBUG batch ate = 0.544514
DEBUG batch ate = 0.426984
DEBUG batch ate = 0.440946
DEBUG batch ate = 0.331075
DEBUG batch ate = 0.486034
DEBUG batch ate = 0.518074
DEBUG batch ate = 0.508189
DEBUG batch ate = 0.7412
DEBUG batch ate = 0.744264
DEBUG batch ate = 0.23702
DEBUG batch ate = 0.724052
DEBUG batch ate = 0.26753
DEBUG batch ate = 0.45962
DEBUG batch ate = 0.447174
DEBUG batch ate = 0.615098
DEBUG batch ate = 0.665408
DEBUG batch ate = 0.227405
DEBUG batch ate = 0.567846
DEBUG batch ate = 0.642301
DEBUG batch ate = 0.572763
DEBUG batch ate = 0.492713
DEBUG batch ate = 0.495091
DEBUG batch ate = 0.387373
DEBUG batch ate = 0.536913
DEBUG batch ate = 0.70732
DEBUG batch ate = 0.57493
DEBUG batch ate = 0.575226
DEBUG batch ate = 0.820646
DEBUG batch ate = 0.299924
DEBUG batch ate = 0.521718
DEBUG batch ate = 0.201825
DEBUG batch ate = 0.575455
DEBUG batch ate = 0.34346
DEBUG batch ate = 0.511799
DEBUG batch ate = 0.577593
DEBUG batch ate = 0.606313
DEBUG batch ate = 0.479831
DEBUG batch ate = 0.430969
DEBUG batch ate = 0.68106
DEBUG batch ate = 0.393857
DEBUG batch ate = 0.592259
DEBUG batch ate = 0.904887
DEBUG batch ate = 1.1646
DEBUG batch ate = 0.462751
DEBUG batch ate = 0.849577
DEBUG batch ate = 0.675505
DEBUG batch ate = 0.655771
DEBUG batch ate = 0.433719
INFO Evaluating 135 minibatches
DEBUG batch ate = 0.228577
DEBUG batch ate = 0.602583
DEBUG batch ate = 0.802412
DEBUG batch ate = 0.445214
DEBUG batch ate = 0.569569
DEBUG batch ate = 0.816098
DEBUG batch ate = 0.799774
DEBUG batch ate = 0.580379
DEBUG batch ate = 0.705277
DEBUG batch ate = 0.472644
DEBUG batch ate = 0.425481
DEBUG batch ate = 0.529719
DEBUG batch ate = 1.03265
DEBUG batch ate = 0.702212
DEBUG batch ate = 0.716867
DEBUG batch ate = 0.732634
DEBUG batch ate = 0.479447
DEBUG batch ate = 0.751748
DEBUG batch ate = 0.372753
DEBUG batch ate = 0.743915
DEBUG batch ate = 0.695771
DEBUG batch ate = 0.486699
DEBUG batch ate = 0.617069
DEBUG batch ate = 0.924266
DEBUG batch ate = 0.41445
DEBUG batch ate = 0.51611
DEBUG batch ate = 0.570871
DEBUG batch ate = 0.52222
DEBUG batch ate = 0.550225
DEBUG batch ate = 0.827474
DEBUG batch ate = 0.660622
DEBUG batch ate = 0.435264
DEBUG batch ate = 0.252852
DEBUG batch ate = 0.521581
DEBUG batch ate = 0.620552
DEBUG batch ate = 0.46738
DEBUG batch ate = 0.469133
DEBUG batch ate = 0.769782
DEBUG batch ate = 0.641767
DEBUG batch ate = 0.61662
DEBUG batch ate = 0.497127
DEBUG batch ate = 0.541457
DEBUG batch ate = 0.950244
DEBUG batch ate = 0.475156
DEBUG batch ate = 0.752711
DEBUG batch ate = 0.301103
DEBUG batch ate = 0.843295
DEBUG batch ate = 0.374278
DEBUG batch ate = 0.686422
DEBUG batch ate = 0.558687
DEBUG batch ate = 0.66816
DEBUG batch ate = 0.756011
DEBUG batch ate = 0.268842
DEBUG batch ate = 0.467443
DEBUG batch ate = 0.7511
DEBUG batch ate = 0.644642
DEBUG batch ate = 0.763036
DEBUG batch ate = 0.590393
DEBUG batch ate = 0.693136
DEBUG batch ate = 0.486587
DEBUG batch ate = 0.604928
DEBUG batch ate = 0.711657
DEBUG batch ate = 0.606803
DEBUG batch ate = 0.514715
DEBUG batch ate = 0.755621
DEBUG batch ate = 0.563381
DEBUG batch ate = 0.658584
DEBUG batch ate = 0.309254
DEBUG batch ate = 0.186426
DEBUG batch ate = 0.642211
DEBUG batch ate = 0.726449
DEBUG batch ate = 0.609017
DEBUG batch ate = 0.693574
DEBUG batch ate = 0.619707
DEBUG batch ate = 0.711907
DEBUG batch ate = 0.763202
DEBUG batch ate = 0.583925
DEBUG batch ate = 0.732382
DEBUG batch ate = 0.598957
DEBUG batch ate = 0.61077
DEBUG batch ate = 0.407628
DEBUG batch ate = 0.813409
DEBUG batch ate = 0.879196
DEBUG batch ate = 0.59526
DEBUG batch ate = 0.597031
DEBUG batch ate = 0.404295
DEBUG batch ate = 0.444806
DEBUG batch ate = 0.976863
DEBUG batch ate = 0.191305
DEBUG batch ate = 0.55377
DEBUG batch ate = 1.03828
DEBUG batch ate = 0.478516
DEBUG batch ate = 0.925168
DEBUG batch ate = 0.605732
DEBUG batch ate = 0.321156
DEBUG batch ate = 0.47538
DEBUG batch ate = 0.750148
DEBUG batch ate = 0.468002
DEBUG batch ate = 0.483354
DEBUG batch ate = 0.727932
DEBUG batch ate = 0.499526
DEBUG batch ate = 0.505064
DEBUG batch ate = 1.03597
DEBUG batch ate = 0.528672
DEBUG batch ate = 0.713761
DEBUG batch ate = 0.657063
DEBUG batch ate = 0.677198
DEBUG batch ate = 0.761366
DEBUG batch ate = 0.569046
DEBUG batch ate = 0.806944
DEBUG batch ate = 0.512402
DEBUG batch ate = 0.638473
DEBUG batch ate = 0.594415
DEBUG batch ate = 0.662585
DEBUG batch ate = 0.815776
DEBUG batch ate = 0.547243
DEBUG batch ate = 0.446772
DEBUG batch ate = 0.609724
DEBUG batch ate = 0.672535
DEBUG batch ate = 0.294262
DEBUG batch ate = 0.650225
DEBUG batch ate = 0.437027
DEBUG batch ate = 0.395884
DEBUG batch ate = 0.457884
DEBUG batch ate = 0.381654
DEBUG batch ate = 0.474322
DEBUG batch ate = 0.636114
DEBUG batch ate = 0.433205
DEBUG batch ate = 0.340026
DEBUG batch ate = 0.631428
DEBUG batch ate = 0.465448
DEBUG batch ate = 0.438805
DEBUG batch ate = 0.50323
DEBUG batch ate = 0.522954
DEBUG batch ate = 0.58916
DR Learner vs. DR-IV Learner vs. X-Learner Benchmark with Synthetic Data
This notebook demonstrates the use of the CausalML implemented DR Learner by Kennedy (2020) (https://arxiv.org/abs/2004.14497) for the Individual Treatment Effect (ITE) estimation.
[2]:
%load_ext autoreload
%autoreload 2
[3]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from xgboost import XGBRegressor
import warnings
from causalml.inference.meta import BaseXRegressor, BaseDRRegressor
from causalml.inference.iv import BaseDRIVRegressor
from causalml.dataset import synthetic_data
from causalml.metrics import *
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')
%matplotlib inline
---------------------------------------------------------------------------RuntimeError Traceback (most recent call last)
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
The sklearn.utils.testing module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
We use a flexible ML estimator to estimate the outcome model but a simple linear regression model to estimate the ITE, since the ITE estimate is often noisy and prone to overfit with a flexible estimator.
DR Learner outforms X Learner in this dataset. Even with built-in mechanism to counteract the unbalancedness between the treatment and control samples, X Learner still suffers from the regime where the treatment probability is close to 1 in this case.
Now we tweaked the previous synthetic data generation by the following 2 changes - Adding a random assignment mechanism. Only assigned units may potentially have a treatment, though whether a unit gets treatment in the assigned group depends on its confounding variables. Therefore this is a situation of one-sided non-compliance. - One of the confounding variables that affects both the propensity to receive treatment and the treatment effect is not observed by analyst. Therefore it is a problem
of hidden confounder.
[8]:
n = 10000
p = 8
sigma = 1.0
X = np.random.uniform(size=n*p).reshape((n, -1))
b = np.sin(np.pi * X[:, 0] * X[:, 1]) + 2 * (X[:, 2] - 0.5) ** 2 + X[:, 3] + 0.5 * X[:, 4]
assignment = (np.random.uniform(size=10000)>0.5).astype(int)
eta = 0.1
e = np.maximum(np.repeat(eta, n), np.minimum(np.sin(np.pi * X[:, 0] * X[:, 1]), np.repeat(1-eta, n)))
e[assignment == 0] = 0
tau = (X[:, 0] + X[:, 1]) / 2
X_obs = X[:, [i for i in range(8) if i!=1]]
w = np.random.binomial(1, e, size=n)
treatment = w
y = b + (w - 0.5) * tau + sigma * np.random.normal(size=n)
Comparing X Learner, DR Learner, and DRIV Learner
We use 3 learners, X Learner, DR Learner, and DRIV Learner, to estimate the ITE of the compliers, i.e. those who only receive treatment when they are assigned.
We continue to see that X Learner generates a noisier ITE estimate than DR Learner, though both of them have a upward bias. But DRIV Learner is able to alleviate the bias significantly.
Meta-Learner Benchmarks with Synthetic Data in Nie and Wager (2020)
This notebook compares X-, R-, T- and S-learners across the simulation setups discussed by Nie and Wager (2020). Note that the experiments don’t include the parameter tuning described in the paper.
[1]:
import numpy as np
import pandas as pd
from causalml.inference.meta import BaseSRegressor
from causalml.inference.meta import BaseTRegressor
from causalml.inference.meta import BaseXRegressor
from causalml.inference.meta import BaseRRegressor
from causalml.dataset import synthetic_data
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.base import clone
from sklearn.linear_model import LogisticRegression, Lasso
from xgboost import XGBRegressor
from copy import deepcopy
from itertools import product
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
Failed to import duecredit due to No module named 'duecredit'
fig, axs = plt.subplots(2, 2, figsize=(15, 10))
axs = axs.ravel()
for i, m in zip(range(4), m_list):
sns.boxplot(x='learner', y='pehe', data=df_res_xgb.loc[df_res_xgb['sim_mode'] == m], linewidth=1, showfliers=False, ax=axs[i])
axs[i].title.set_text(data_generation_descs[m] + ' (XGB)')
axs[i].set_ylabel('PEHE (MSE)')
axs[i].set_xlabel('') # Hack
axs[i].tick_params(labelsize=18)
plt.tight_layout()
Causal Trees/Forests Treatment Effects Estimation and Tree Visualization
[1]:
import pandas as pd
import numpy as np
import multiprocessing as mp
from collections import defaultdict
np.random.seed(42)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
import causalml
from causalml.metrics import plot_gain, plot_qini, qini_score
from causalml.dataset import synthetic_data
from causalml.inference.tree import plot_dist_tree_leaves_values, get_tree_leaves_mask
from causalml.inference.meta import BaseSRegressor, BaseXRegressor, BaseTRegressor, BaseDRRegressor
from causalml.inference.tree import CausalRandomForestRegressor
from causalml.inference.tree import CausalTreeRegressor
from causalml.inference.tree.plot import plot_causal_tree
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_format = 'retina'
Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
Failed to import duecredit due to No module named 'duecredit'
# Simulate randomized trial: mode=2
y, X, w, tau, b, e = synthetic_data(mode=2, n=15000, p=20, sigma=5.5)
df = pd.DataFrame(X)
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
df.columns = feature_names
df['outcome'] = y
df['treatment'] = w
df['treatment_effect'] = tau
[4]:
df.head()
[4]:
feature_0
feature_1
feature_2
feature_3
feature_4
feature_5
feature_6
feature_7
feature_8
feature_9
...
feature_13
feature_14
feature_15
feature_16
feature_17
feature_18
feature_19
outcome
treatment
treatment_effect
0
0.496714
-0.138264
0.358450
1.523030
-0.234153
-0.234137
1.579213
0.767435
-0.469474
0.542560
...
-1.913280
-1.724918
-0.562288
-1.012831
0.314247
-0.908024
-1.412304
7.124356
1
1.123117
1
1.465649
-0.225776
1.239872
-1.424748
-0.544383
0.110923
-1.150994
0.375698
-0.600639
-0.291694
...
-1.057711
0.822545
-1.220844
0.208864
-1.959670
-1.328186
0.196861
-11.263144
0
2.052266
2
0.738467
0.171368
0.909835
-0.301104
-1.478522
-0.719844
-0.460639
1.057122
0.343618
-1.763040
...
0.611676
1.031000
0.931280
-0.839218
-0.309212
0.331263
0.975545
0.269378
0
1.520964
3
-0.479174
-0.185659
0.000000
-1.196207
0.812526
1.356240
-0.072010
1.003533
0.361636
-0.645120
...
1.564644
-2.619745
0.821903
0.087047
-0.299007
0.091761
-1.987569
-0.976893
0
0.125446
4
-0.219672
0.357113
0.137441
-0.518270
-0.808494
-0.501757
0.915402
0.328751
-0.529760
0.513267
...
-0.327662
-0.392108
-1.463515
0.296120
0.261055
0.005113
-0.234587
-1.949163
1
0.667889
5 rows × 23 columns
[5]:
# Look at the conversion rate and sample size in each group
df.pivot_table(values='outcome',
index='treatment',
aggfunc=[np.mean, np.size],
margins=True)
# Split data to training and testing samples for model validation (next section)
df_train, df_test = train_test_split(df, test_size=0.2, random_state=11101)
n_test = df_test.shape[0]
n_train = df_train.shape[0]
[8]:
# Table to gather estimated ITEs by models
df_result = pd.DataFrame({
'outcome': df_test['outcome'],
'is_treated': df_test['treatment'],
'treatment_effect': df_test['treatment_effect']
})
standard_mse: scikit-learn MSE where node values store \(E_{node_i}(X|T=1)-E_{node_i}(X|T=0)\), treatment effects.
causal_mse: The criteria reward a partition for finding strong heterogeneity in treatment effects and penalize a partition that creates variance in leaf estimates.https://www.pnas.org/doi/10.1073/pnas.1510489113
Causal Trees/Forests Interpretation with Feature Importance and SHAP Values
[1]:
import pandas as pd
import numpy as np
import multiprocessing as mp
np.random.seed(42)
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
import shap
import causalml
from causalml.metrics import plot_gain, plot_qini, qini_score
from causalml.dataset import synthetic_data
from causalml.inference.tree import plot_dist_tree_leaves_values, get_tree_leaves_mask
from causalml.inference.tree import CausalRandomForestRegressor, CausalTreeRegressor
from causalml.inference.tree.utils import timeit
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_format = 'retina'
Failed to import duecredit due to No module named 'duecredit'
[2]:
import importlib
for libname in ["causalml", "shap"]:
print(f"{libname}: {importlib.metadata.version(libname)}")
causalml: 0.14.1
shap: 0.42.1
[3]:
# Simulate randomized trial: mode=2
y, X, w, tau, b, e = synthetic_data(mode=2, n=2000, p=10, sigma=3.0)
df = pd.DataFrame(X)
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
df.columns = feature_names
df['outcome'] = y
df['treatment'] = w
df['treatment_effect'] = tau
[4]:
# Split data to training and testing samples for model validation
df_train, df_test = train_test_split(df, test_size=0.2, random_state=111)
n_train, n_test = df_train.shape[0], df_test.shape[0]
X_train, y_train = df_train[feature_names], df_train['outcome'].values
X_test, y_test = df_test[feature_names], df_test['outcome'].values
treatment_train, treatment_test = df_train['treatment'].values, df_test['treatment'].values
effect_test = df_test['treatment_effect'].values
observation = X_test.loc[[0]]
tree_explainer = shap.TreeExplainer(ctree)
# Expected values for treatment=0 and treatment=1. i.e. Y|X,T=0 and Y|X,T=1
tree_explainer.expected_value
[10]:
array([0.93121212, 1.63459276])
[11]:
# Tree Explainer for treatment=0
shap.initjs()
shap_values = tree_explainer.shap_values(observation)
shap.force_plot(tree_explainer.expected_value[0],
shap_values[0],
observation)
[11]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another
user you must also trust this notebook (File -> Trust notebook). If you are viewing
this notebook on github the Javascript has been stripped for security. If you are using
JupyterLab this error is because a JupyterLab extension has not yet been written.
[12]:
# Tree Explainer for treatment=1
tree_explainer = shap.TreeExplainer(ctree)
shap.initjs()
shap_values = tree_explainer.shap_values(observation)
shap.force_plot(tree_explainer.expected_value[1],
shap_values[1],
observation)
[12]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another
user you must also trust this notebook (File -> Trust notebook). If you are viewing
this notebook on github the Javascript has been stripped for security. If you are using
JupyterLab this error is because a JupyterLab extension has not yet been written.
[13]:
# Tree Explainer for treatment=0
cforest_explainer = shap.TreeExplainer(crforest)
shap.initjs()
shap_values = cforest_explainer.shap_values(observation)
shap.force_plot(cforest_explainer.expected_value[0],
shap_values[0],
observation)
[13]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another
user you must also trust this notebook (File -> Trust notebook). If you are viewing
this notebook on github the Javascript has been stripped for security. If you are using
JupyterLab this error is because a JupyterLab extension has not yet been written.
[14]:
# Tree Explainer for treatment=1
cforest_explainer = shap.TreeExplainer(crforest)
shap.initjs()
shap_values = cforest_explainer.shap_values(observation)
shap.force_plot(cforest_explainer.expected_value[1],
shap_values[1],
observation)
[14]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another
user you must also trust this notebook (File -> Trust notebook). If you are viewing
this notebook on github the Javascript has been stripped for security. If you are using
JupyterLab this error is because a JupyterLab extension has not yet been written.
[15]:
for i in [0, 1]:
print(f"If treatment = {i},i.e. Y_hat|X,T={i}")
shap.dependence_plot("feature_0",
cforest_explainer.shap_values(X_test)[i],
X_test,
interaction_index="feature_2")
If treatment = 0,i.e. Y_hat|X,T=0
If treatment = 1,i.e. Y_hat|X,T=1
[16]:
# Sort the features indexes by their importance in the model
# (sum of SHAP value magnitudes over the validation dataset)
for i in [0, 1]:
print(f"If treatment = {i},i.e. Y_hat|X,T={i}")
shap_values = cforest_explainer.shap_values(X_test)[i]
top_inds = np.argsort(-np.sum(np.abs(shap_values), 0))
# Make SHAP plots of the three most important features
for i in range(4):
shap.dependence_plot(top_inds[i], shap_values, X_test)
If treatment = 0,i.e. Y_hat|X,T=0
If treatment = 1,i.e. Y_hat|X,T=1
[ ]:
Logistic Regression Based Data Generation Function for Uplift Classification Problem
This Data Generation Function uses Logistic Regression as the underlying data generation model. This function enables better control of feature patterns: how feature is associated with outcome baseline and treatment effect. It enables 6 differernt patterns: Linear, Quadratic, Cubic, Relu, Sine, and Cosine.
This notebook shows how to use this data generation function to generate data, with a visualization of the feature patterns.
[1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
from causalml.dataset import make_uplift_classification_logistic
The sklearn.utils.testing module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
# Extract control and treatment1 for illustration
treatment_group_keys = ['control','treatment1']
y_name='conversion'
df1 = df[df['treatment_group_key'].isin(treatment_group_keys)].reset_index(drop=True)
df1.groupby(['treatment_group_key'])['conversion'].mean()
[51]:
treatment_group_key
control 0.09896
treatment1 0.15088
Name: conversion, dtype: float64
[53]:
color_dict = {'control':'#2471a3','treatment1':'#FF5733','treatment2':'#5D6D7E'
,'treatment3':'#34495E','treatment4':'#283747'}
hatch_dict = {'control':'','treatment1':'//'}
x_name_plot = ['x11_uplift', 'x12_uplift', 'x2_informative', 'x5_informative']
x_new_name_plot = ['Uplift Feature 1', 'Uplift Feature 2', 'Classification Feature 1','Classification Feature 2']
opacity = 0.8
plt.figure(figsize=(20, 3))
subplot_list = [141,142,143,144]
counter = 0
bar_width = 0.9/len(treatment_group_keys)
for x_name_i in x_name_plot:
bins = np.percentile(df1[x_name_i].values, np.linspace(0, 100, 11))[:-1]
df1['x_bin'] = np.digitize(df1[x_name_i].values, bins)
df_gb = df1.groupby(['treatment_group_key','x_bin'],as_index=False)[y_name].mean()
plt.subplot(subplot_list[counter])
for ti in range(len(treatment_group_keys)):
x_index = [ti * bar_width - len(treatment_group_keys)/2*bar_width + xi for xi in range(10)]
plt.bar(x_index,
df_gb[df_gb['treatment_group_key']==treatment_group_keys[ti]][y_name].values,
bar_width,
alpha=opacity,
color=color_dict[treatment_group_keys[ti]],
hatch = hatch_dict[treatment_group_keys[ti]],
label=treatment_group_keys[ti]
)
plt.xticks(range(10), [int(xi+10) for xi in np.linspace(0, 100, 11)[:-1]])
plt.xlabel(x_new_name_plot[counter],fontsize=16)
plt.ylabel('Conversion',fontsize=16)
#plt.title(x_name_i)
if counter == 0:
plt.legend(treatment_group_keys, loc=2,fontsize=16)
plt.ylim([0.,0.3])
counter+=1
In the figure above, Uplift Feature 1 has a linear pattern on treatment effect, Uplift Feature 2 has a quadratic pattern on treatment effect, Classification Feature 1 has a quadratic pattern on baseline for both treatment and control, and Classification Feature 2 has a Sine pattern on baseline for both treatment and control.
This notebook shows approaches to evaluating multi-armed CATE estimators from causalML with the Multi-Armed Qini metric available in the maq package (available at https://github.com/grf-labs/maq).
This metric is a generalization of the familiar Qini curve to settings where we have multiple treatment arms available, and the cost of assigning treatment can vary by both unit and treatment arm according to some known cost structure. At a high level, this metric essentially allows you to quantify the value of targeting with more treatment arms by undertaking a cost-benefit exercise that uses your CATE estimates to assign the arm to the unit that is most cost-beneficial at various budget
constraints.
This notebook gives a brief overview of the statistical setup and a walkthrough with a simple simulated example.
To use this functionality, you first have to install the maq Python package from GitHub. The latest source release can be installed with:
Let \(k = 1, \ldots K\) denote one of \(K\) mutually exclusive and costly treatment arms and \(k = 0\) a (costless) control arm. Let \(Y_i(k)\) denote the potential outcome in the \(k\)-th arm for unit i, and \(X_i\) a set of observable characteristics.
Let the function \(\hat \tau(\cdot)\) be an estimate of the conditional average treatment effect (CATE) obtained from a training set, where the \(k\)-th element estimates
\[\tau_k(X_i) = E[Y_i(k) - Y(0) ~|~ X_i].\]
The Qini curve \(Q(B)\) quantifies the value of assigning treatment in accordance with our estimated function \(\hat \tau(\cdot)\) over different values of a budget constraint \(B\). With a single treatment arm\(K=1\) we can formalize this as
where \(\pi_B(X_i) \in \{0, 1\}\) is the policy that assigns treatment (=1) to those units predicted by \(\hat \tau(\cdot)\) to benefit the most such that on average we incur a cost of at most B. If we let \(C(\cdot)\) denote our known cost function (e.g. \(C(X_i) = 4.2\) means assigning the \(i\)-th unit the treatment costs 4.2 on some chosen cost denomination), then \(\pi_B\) is going to look like
While slightly daunting written down formally, it turns out expressing \(\pi_B\) is quite simple: it essentially reduces to a thresholding rule: for a given \(B\), treat the units where the predicted cost-to-benefit ratio \(\frac{\hat \tau(X_i)}{C(X_i)}\) is above a cutoff. The Qini curve can be used to quantify the value, as measured by the expected gain over assigning each unit the control arm when using the estimated function \(\hat \tau(\cdot)\) with cost structure
\(C(\cdot)\) to allocate treatment, as we vary the available budget \(B\).
With multiple treatment arms\(K > 1\), our object of interest, the Qini curve, is the same, we just need to add an inner product \(\langle,\rangle\) to the notation
to denote that \(\pi_B(X_i)\) now is a \(K\)-length selection vector with 1 in the \(k\)-th entry if we predict that it is optimal to assign the \(i\)-th unit that arm at the given budget constraint. Similarly to above, \(\pi_B\) takes the following form
Expressing \(\pi_B\) is more complicated now, as for each budget constraint \(B\), \(\pi_B\) has to make decisions of the form “should I assign the \(i\)-th unit an initial arm, or if the \(j\)-th unit had already been assigned an arm: should I upgrade this person to a costlier but more effective arm?”. It turns out that it is possible to express \(\pi_B\) as a thresholding rule (for details we refer to this paper), yielding
tractable ways to construct Qini curves for multi-armed treatment rules.
Generate some simple (synthetic) data with \(K=2\) treatment arms, where the second arm is more effective on average.
[3]:
n = 20000
p = 5
# Draw a treatment assignment from {0, 1, 2} uniformly
W = np.random.choice([0, 1, 2], n)
# Generate p observable characteristics where some are related to the CATE
X = np.random.rand(n, p)
Y = X[:, 1] + X[:, 2]*(W == 1) + 1.5*X[:, 3]*(W == 2) + np.random.randn(n)
# (in this example, the true arm 2 CATE is 1.5*X[:, 3])
# Generate a train/test split
n_train = 15000
ix = np.random.permutation(n)
train, test = ix[:n_train], ix[n_train:]
Obtain \(\hat \tau(\cdot)\) by fitting a CATE estimator on the training set (using an R-learner, for example).
At a high level, there are two tasks associated with estimating a Qini curve \(Q(B)\). The first one is estimating the underlying policy \(\pi_B\), and the second is estimating the value of this policy.
As mentioned in the previous section, with multiple costly treatment arms, the policy \(\pi_B\) is more complicated to compute than in the single-armed case, since given a treatment effect function (obtained from some training set) and a cost structure, we need to figure out which arm to assign to which unit at every budget constraint. The maq package performs this first step with an algorithm that gives the path of multi-armed policies \(\pi_B\).
For the second step of estimating the value of this policy, we need to construct a matrix of suitable evaluation scores (that we denote by \(\Gamma\)) that have the property that when averaged they act as treatment effect estimates.
If we know the treatment randomization probabilities \(P[W_i=k~|~X_i]\), it turns out that constructing these scores is easy: we can simply use inverse-propensity weighting (IPW). With \(K\) treatment arms, the scores for the \(k\)-th arm then takes the following form
where \(W_i\) and \(Y_i\) are the treatment assignment and observed outcome for test set unit i. An estimate of the ATE for the \(k\)-th arm is given by the average of these scores: \(\frac{1}{n_{test}} \sum_{i=1}^{n_{test}} \Gamma_{i,k}\). An IPW-based estimate of \(Q(B)\) is going to be an average of these scores that “matches” the underlying policy prescription \(\pi_B\).
the maq package has a simple convenience utility get_ipw_scores that can be used to construct these via IPW (which by default assumes the arms have uniform assignment probabilities \(\frac{1}{K+1}\)).
Note: if the randomization probabilities are not known (as could be the case in an observational setting), then a more robust alternative to form the scores via plugging in estimates of the propensities into the expression above, is to use augmented inverse-propensity weighting (AIPW), yielding a doubly robust estimate of the Qini curve. This approach is not covered here, for details we refer to the paper.
[5]:
# Construct an n_test*K matrix of evaluation scores
IPW_scores = get_ipw_scores(Y[test], W[test])
# Predict CATEs on the test set
tau_hat = tau_function.predict(X[test, :])
# Specify our cost structure,
# assume the cost of assigning each unit the first arm is 0.2
# and the cost of assigning each unit the second more effective arm is 0.5
# (these can also be array-valued if costs vary by unit)
cost = [0.2, 0.5]
[6]:
# Start by fitting a baseline Qini curve that only considers the average treatment effects and costs
qini_baseline = MAQ(target_with_covariates=False, n_bootstrap=200)
qini_baseline.fit(tau_hat, cost, IPW_scores)
qini_baseline.plot()
This curve has a kink at \(B=0.2\): the first segment traces out the ATE of the lower cost arm, and the second segment the ATE of the higher cost but on average more effective arm. Points on this curve represents the average benefit per unit when targeting an arbitrary group of units.
For example, at an average spend of 0.2 our gain (along with standard errors) is equal to the arm 1 ATE of
The Qini curve for this arm plateaus at a spend of around
[10]:
print(qini_1.path_spend_[-1])
0.168840000000009
which means that at this spend level we have given treatment to all units predicted to benefit from treatment (that is, tau_hat_1 is > 0). We can read off estimates and std errors from the curve, for example at a spend of \(B=0.1\) per unit, the estimated average treatment effect per unit is
[11]:
qini_1.average_gain(0.1)
[11]:
(0.36935574662263526, 0.037401976389534526)
(these standard errors are conditional on the estimated function \(\hat \tau(\cdot)\) and quantify test set uncertainty in estimating the Qini curve).
We can assess the value of targeting with arm 1 at various spend levels by estimating the vertical difference between the blue and black line. Let’s call the Qini curve for arm 1 \(Q_1(B)\) and the Qini curve for the baseline policy \(\overline Q(B)\). At \(B=0.1\), an estimate of \(Q_1(0.1) - \overline Q(0.1)\) is
[12]:
est, std_err = qini_1.difference_gain(qini_baseline, 0.1)
est, std_err
[12]:
(0.12102711982945671, 0.024068445449392815)
That is, at a budget of 0.1 per unit a 95% confidence interval for the increase in gain when targeting with the given arm 1 CATE function over random targeting is
[13]:
[est - 1.96*std_err, est + 1.96*std_err]
[13]:
[0.0738529667486468, 0.16820127291026662]
(points on aribtrary curves can be compared with the difference_gain method, yielding paired standard errors that account for the correlation between Qini curves fit on the same test data).
Similarily, we can estimate a Qini curve \(Q_2(B)\) for the second costlier arm
Qini curves for single-armed treatment rules allow for assessing the value of targeting with a specific arm or targeting function. The generalization of the Qini to multiple treatment arms allows us to also assess the value of targeting with a combination of arms.
At \(B=0.3\), the estimated increase in gain when targeting with both arms over using only the second arm, \(Q(0.3) - Q_2(0.3)\), is
[17]:
qini_ma.difference_gain(qini_2, 0.3)
[17]:
(0.17003364056661086, 0.036733311977033105)
In this example, a multi-armed policy achieves a larger gain by assigning the treatment that is most cost-beneficial to each test set unit. The underlying policy \(\pi_B\) looks like
where rows correspond to \(\pi_B(X_i)\), where the \(k\)-th column contains a 1 if it is optimal to assign this arm to the \(i\)-th unit at the given spend (and all entries 0 if the control arm is assigned). An alternative representation of the policy is to take values in the treatment arm set {0, 1, 2}
[19]:
qini_ma.predict(0.3, prediction_type="vector")
[19]:
array([2, 2, 1, ..., 1, 2, 2])
In addition to comparing points on different Qini curves, we can also compare across a range of spend levels by estimating an area between two curves up to a maximum \(\overline B\). An estimate and standard error of the area between the green and red curves up to \(\overline B=0.5\), the integral \(\int_{0}^{0.5} (Q(B) - Q_2(B))dB\), is
In this section we dive more deeply into the algorithms implemented in CausalML. To provide a basis for the discussion, we review some of the frameworks and definitions used in the literature.
We use the Neyman-Rubin potential outcomes framework and assume Y represents the outcome, W represents the treatment assignment, and X_i the observed covariates.
A meta-algorithm (or meta-learner) is a framework to estimate the Conditional Average Treatment Effect (CATE) using any machine learning estimators (called base learners) [16].
A meta-algorithm uses either a single base learner while having the treatment indicator as a feature (e.g. S-learner), or multiple base learners separately for each of the treatment and control groups (e.g. T-learner, X-learner and R-learner).
Confidence intervals of average treatment effect estimates are calculated based on the lower bound formular (7) from [14].
Including the propensity score in the model can reduce bias from regularization induced confounding [30].
When the control and treatment groups are very different in covariates, a single linear model is not sufficient to encode the different relevant dimensions and smoothness of features for the control and treatment groups [1].
Impute the user level treatment effects, \(D^1_i\) and \(D^0_j\) for user \(i\) in the treatment group based on \(\mu_0(x)\), and user \(j\) in the control groups based on \(\mu_1(x)\):
R-learner [19] uses the cross-validation out-of-fold estimates of outcomes \(\hat{m}^{(-i)}(x_i)\) and propensity scores \(\hat{e}^{(-i)}(x_i)\). It consists of two stages as follows:
Stage 1
Fit \(\hat{m}(x)\) and \(\hat{e}(x)\) with machine learning models using cross-validation.
Stage 2
Estimate treatment effects by minimising the R-loss, \(\hat{L}_n(\tau(x))\):
DR-learner [15] estimates the CATE via cross-fitting a doubly-robust score function in two stages as follows. We start by randomly split the data \(\{Y, X, W\}\) into 3 partitions \(\{Y^i, X^i, W^i\}, i=\{1,2,3\}\).
Stage 1
Fit a propensity score model \(\hat{e}(x)\) with machine learning using \(\{X^1, W^1\}\), and fit outcome regression models \(\hat{m}_0(x)\) and \(\hat{m}_1(x)\) for treated and untreated users with machine learning using \(\{Y^2, X^2, W^2\}\).
Stage 2
Use machine learning to fit the CATE model, \(\hat{\tau}(X)\) from the pseudo-outcome
Repeat Stage 1 and Stage 2 again twice. First use \(\{Y^2, X^2, W^2\}\), \(\{Y^3, X^3, W^3\}\), and \(\{Y^1, X^1, W^1\}\) for the propensity score model, the outcome models, and the CATE model. Then use \(\{Y^3, X^3, W^3\}\), \(\{Y^2, X^2, W^2\}\), and \(\{Y^1, X^1, W^1\}\) for the propensity score model, the outcome models, and the CATE model. The final CATE model is the average of the 3 CATE models.
We combine the idea from DR-learner [15] with the doubly robust score function for LATE described in [10] to estimate the conditional LATE. Towards that end, we start by randomly split the data \(\{Y, X, W, Z\}\) into 3 partitions \(\{Y^i, X^i, W^i, Z^i\}, i=\{1,2,3\}\).
Stage 1
Fit propensity score models \(\hat{e}_0(x)\) and \(\hat{e}_1(x)\) for assigned and unassigned users using \(\{X^1, W^1, Z^1\}\), and fit outcome regression models \(\hat{m}_0(x)\) and \(\hat{m}_1(x)\) for assigned and unassigned users with machine learning using \(\{Y^2, X^2, Z^2\}\). Assignment probabiliy, \(p_Z\), can either be user provided or come from a simple model, since in most use cases assignment is random by design.
Stage 2
Use machine learning to fit the conditional LATE model, \(\hat{\tau}(X)\) by minimizing the following loss function
Similar to the DR-Learner Repeat Stage 1 and Stage 2 again twice with different permutations of partitions for estimation. The final conditional LATE model is the average of the 3 conditional LATE models.
The Uplift Tree approach consists of a set of methods that use a tree-based algorithm where the splitting criterion is based on differences in uplift. [22] proposed three different ways to quantify the gain in divergence as the result of splitting [11]:
where \(D\) measures the divergence and \(P^T\) and \(P^C\) refer to the probability distribution of the outcome of interest in the treatment and control groups, respectively. Three different ways to quantify the divergence, KL, ED and Chi, are implemented in the package.
where \(p\) is the sample mean in the treatment group, \(q\) is the sample mean in the control group and \(k\) indicates the leaf in which \(p\) and \(q\) are computed [11]
Another Uplift Tree algorithm that is implemented is the delta-delta-p (\(\Delta\Delta P\)) approach by [9], where the sample splitting criterion is defined as follows:
where \(a_0\) and \(a_1\) are the outcomes of a Split A, \(y\) is the selected class, and \(P^T\) and \(P^C\) are the response rates of treatment and control group, respectively. In other words, we first calculate the difference in the response rate in each branch (\(\Delta P_{left}\) and \(\Delta P_{right}\)), and subsequently, calculate their differences (\(\Delta\Delta P = |\Delta P_{left} - \Delta P_{right}|\)).
where the entropy H is defined as \(H(p,q)=(-p*log_2(p)) + (-q*log_2(q))\) and where \(\phi\) is a subset of the feature space
associated with the current decision node, and \(\phi_l\) and \(\phi_r\) are the left and right child nodes, respectively.
\(n_t(\phi)\) is the number of treatment samples, \(n_c(\phi)\) the number of control samples, and \(n(\phi)\) the number
of all samples in the current (parent) node.
Further, the package implements the Interaction Tree (IT) proposed by [26], where the sample splitting criterion
maximizes the G statistic among all permissible splits:
where \(\sigma=\sum_{i=4}^4w_is_i^2\) is a pooled estimator of the constant variance, and \(w_i=(n_i-1)/\sum_{j=1}^4(n_j-1)\).
Further, \(y^L_1\), \(s^2_1\), and \(n_1\) are the the sample mean, the sample variance, and the sample size
for the treatment group in the left child node ,respectively. Similar notation applies to the other quantities.
Note that this implementation deviates from the original implementation in that (1) the pruning techniques and (2) the validation method
for determining the best tree size are different.
Also, the package implements the Causal Inference Tree (CIT) by [25], where the sample splitting
criterion calculates the likelihood ratio test statistic:
where \(n_{\tau}\), \(n_{\tau 0}\), and \(n_{\tau 1}\) are the total number of observations in node \(\tau\),
the number of observations in node \(\tau\) that are assigned to the control group, and the number of observations in node \(\tau\)
that are assigned to the treatment group, respectively. \(SSE_{\tau}\) is defined as:
and \(\hat{y_{t0}}\) and \(\hat{y_{t1}}\) are the sample average responses of the control and treatment groups in node
\(\tau\), respectively.
Note that this implementation deviates from the original implementation in that (1) the pruning techniques and (2) the validation method
for determining the best tree size are different.
The final Uplift Tree algorithm that is implemented is the Contextual Treatment Selection (CTS) approach by [28], where the sample splitting criterion is defined as follows:
where \(\phi_l\) and \(\phi_r\) refer to the feature subspaces in the left leaf and the right leaves respectively, \(\hat{p}(\phi_j \mid \phi)\) denotes the estimated conditional probability of a subject’s being in \(\phi_j\) given \(\phi\), and \(\hat{y}_t(\phi_j)\) is the conditional expected response under treatment \(t\).
The package supports methods for assigning treatment groups when treatments are costly. To understand the problem, it is helpful to divide populations into the following four categories:
Compliers. Those who will have a favourable outcome if and only if they are treated.
Always-takers. Those who will have a favourable outcome whether or not they are treated.
Never-takers. Those who will never have a favourable outcome whether or not they are treated.
Defiers. Those who will have a favourable outcome if and only if they are not treated.
[18] propose a method for selecting units for treatments using counterfactual logic. Suppose the following benefits for selecting units belonging to the different categories above:
Compliers: \(\beta\)
Always-takers: \(\gamma\)
Never-takers: \(\theta\)
Defiers: \(\delta\)
If \(X\) denotes the set of individual’s features, the unit selection problem can be formulated as follows:
The problem can be reformulated using counterfactual logic. Suppose \(W = w\) indicates that an individual is treated and \(W = w'\) indicates he or she is untreated. Similarly, let \(F = f\) denote a favourable outcome for the individual and \(F = f'\) an unfavourable outcome. Then the optimization problem becomes:
Note that the above simply follows from the definitions of the relevant users segments. [18] then use counterfactual logic ([21]) to solve the above optimization problem under certain conditions.
N.B. The current implementation in the package is highly experimental.
The counterfactual value estimation method implemented in the package predicts the outcome for a unit under different treatment conditions using a standard machine learning model. The expected value of assigning a unit into a particular treatment is then given by
\[\mathbb{E}[(v - cc_w)Y_w - ic_w]\]
where \(Y_w\) is the probability of a favourable event (such as conversion) under a given treatment \(w\), \(v\) is the value of the favourable event, \(cc_w\) is the cost of the treatment triggered in case of a favourable event, and \(ic_w\) is the cost associated with the treatment whether or not the outcome is favourable. This method builds upon the ideas discussed in [29].
A cause is said to be necessary for an outcome if the outcome would not have occurred in the absence of the cause. A cause is said to be sufficient for an outcome if the outcome would have occurred in the presence of the cause. A cause is said to be necessary and sufficient if both of the above two conditions hold. [27] show that we can calculate bounds for the probability that a cause is of each of the above three types.
To understand how the bounds for the probabilities of causation are calculated, we need special notation to represent counterfactual quantities. Let \(y_t\) represent the proposition “\(y\) would occur if the treatment group was set to ‘treatment’”, \(y^{\prime}_c\) represent the proposition “\(y\) would not occur if the treatment group was set to ‘control’”, and similarly for the remaining two combinations of the (by assumption) binary outcome and treatment variables.
Then the probability that the treatment is sufficient for \(y\) to occur can be defined as
\[PS = P(y_t \mid c, y^{\prime})\]
This is the probability that the \(y\) would occur if the treatment was set to \(t\) when in fact the treatment was set to control and the outcome did not occur.
The probability that the treatment is necessary for \(y\) to occur can be defined as
\[PN = P(y^{\prime}_c \mid t, y)\]
This is the probability that \(y\) would not occur if the treatment was set to control, while in actuality both \(y\) occurs and the treatment takes place.
Finally, the probability that the treatment is both necessary and sufficient is defined as
\[PNS = P(y_t, y^{\prime}_c)\]
and states that \(y\) would occur if the treatment took place; and \(y\) would not occur if the treatment did not take place. PNS is related with PN and PS as follows:
\[PNS = P(t, y)PN + P(c, y^{\prime})PS\]
In bounding the above three quantities, we utilize observational data in addition to experimental data. The observational data is characterized in terms of the joint probabilities:
Given this, [27] use the program developed in [8] to obtain sharp bounds of the above three quantities. The main idea in this program is to turn the bounding task into a linear programming problem (for a modern implementation of their approach see here).
Using the linear programming approach and given certain constraints together with observational data, [27] find that the shar lower bound for PNS is given by
They use a similar routine to find the bounds for PS and PN. The get_pns_bounds() function calculates the bounds for each of the three probabilities of causation using the results in [27].
The package supports selected traditional causal inference methods. These are usually used to conduct causal inference with observational (non-experimental) data. In these types of studies, the observed difference between the treatment and the control is in general not equal to the difference between “potential outcomes” \(\mathbb{E}[Y(1) - Y(0)]\). Thus, the methods below try to deal with this problem in different ways.
The general idea in matching is to find treated and non-treated units that are as similar as possible in terms of their relevant characteristics. As such, matching methods can be seen as part of the family of causal inference approaches that try to mimic randomized controlled trials.
While there are a number of different ways to match treated and non-treated units, the most common method is to use the propensity score:
\[e_i(X_i) = P(W_i = 1 \mid X_i)\]
Treated and non-treated units are then matched in terms of \(e(X)\) using some criterion of distance, such as \(k:1\) nearest neighbours. Because matching is usually between the treated population and the control, this method estimates the average treatment effect on the treated (ATT):
\[\mathbb{E}[Y(1) \mid W = 1] - \mathbb{E}[Y(0) \mid W = 1]\]
See [24] for a discussion of the strengths and weaknesses of the different matching methods.
The inverse probability of treatment weighting (IPTW) approach uses the propensity score \(e\) to weigh the treated and non-treated populations by the inverse of the probability of the actual treatment \(W\). For a binary treatment \(W \in \{1, 0\}\):
\[\frac{W}{e} + \frac{1 - W}{1 - e}\]
In this way, the IPTW approach can be seen as creating an artificial population in which the treated and non-treated units are similar in terms of their observed features \(X\).
One of the possible benefits of IPTW compared to matching is that less data may be discarded due to lack of overlap between treated and non-treated units. A known problem with the approach is that extreme propensity scores can generate highly variable estimators. Different methods have been proposed for trimming and normalizing the IPT weights ([13]). An overview of the IPTW approach can be found in [7].
One of the basic requirements for identifying the treatment effect of \(W\) on \(Y\) is that \(W\) is orthogonal to the potential outcome of \(Y\), conditional on the covariates \(X\). This may be violated if both \(W\) and \(Y\) are affected by an unobserved variable, the error term after removing the true effect of \(W\) from \(Y\), that is not in \(X\). In this case, the instrumental variables approach attempts to estimate the effect of \(W\) on \(Y\) with the help of a third variable \(Z\) that is correlated with \(W\) but is uncorrelated with the error term. In other words, the instrument \(Z\) is only related with \(Y\) through the directed path that goes through \(W\). If these conditions are satisfied, in the case without covariates, the effect of \(W\) on \(Y\) can be estimated using the sample analog of:
\[\frac{Cov(Y_i, Z_i)}{Cov(W_i, Z_i)}\]
The most common method for instrumental variables estimation is the two-stage least squares (2SLS). In this approach, the cause variable \(W\) is first regressed on the instrument \(Z\). Then, in the second stage, the outcome of interest \(Y\) is regressed on the predicted value from the first-stage model. Intuitively, the effect of \(W\) on \(Y\) is estimated by using only the proportion of variation in \(W\) due to variation in \(Z\). Specifically, assume that we have the linear model
\[Y = W \alpha + X \beta + u = \Xi \gamma + u\]
Here for convenience we let \(\Xi=[W, X]\) and \(\gamma=[\alpha', \beta']'\). Assume that we have instrumental variables \(Z\) whose number of columns is at least the number of columns of \(W\), let \(\Omega=[Z, X]\), 2SLS estimator is as follows
In many situations the treatment \(W\) may depend on subject’s own choice and cannot be administered directly in an experimental setting. However one can randomly assign users into treatment/control groups so that users in the treatment group can be nudged to take the treatment. This is the case of noncompliance, where users may fail to comply with their assignment status, \(Z\), as to whether to take treatment or not. Similar to the section of Value optimization methods, in general there are 3 types of users in this situation,
Compliers Those who will take the treatment if and only if they are assigned to the treatment group.
Always-Taker Those who will take the treatment regardless which group they are assigned to.
Never-Taker Those who wil not take the treatment regardless which group they are assigned to.
However one assumes that there is no Defier for identification purposes, i.e. those who will only take the treatment if they are assigned to the control group.
In this case one can measure the treatment effect of Compliers,
This is Local Average Treatment Effect (LATE). The estimator is also equivalent to 2SLS if we take the assignment status, \(Z\), as an instrument.
Targeted maximum likelihood estimation (TMLE) for ATE
Targeted maximum likelihood estimation (TMLE) [17] provides a doubly robust semiparametric method that “targets” directly on the average treatment effect with the aid from machine learning algorithms. Compared to other methods including outcome regression and inverse probability of treatment weighting, TMLE usually gives better performance especially when dealing with skewed treatment and outliers.
Given binary treatment \(W\), covariates \(X\), and outcome \(Y\), the TMLE for ATE is performed in the following steps
Step 1
Use cross fit to estimate the propensity score \(\hat{e}(x)\), the predicted outcome for treated \(\hat{m}_1(x)\), and predicted outcome for control \(\hat{m}_0(x)\) with machine learning.
Step 2
Scale \(Y\) into \(\tilde{Y}=\frac{Y-\min Y}{\max Y - \min Y}\) so that \(\tilde{Y} \in [0,1]\). Use the same scale function to transform \(\hat{m}_i(x)\) into \(\tilde{m}_i(x)\), \(i=0,1\). Clip the scaled functions so that their values stay in the unit interval.
Step 3
Let \(Q=\log(\tilde{m}_W(X)/(1-\tilde{m}_W(X)))\). Maximize the following pseudo log-likelihood function
fromcausalml.inference.metaimportBaseSRegressor,BaseTRegressor,BaseXRegressor,BaseRRegressorslearner=BaseSRegressor(LGBMRegressor(),control_name='control')slearner.estimate_ate(X,w_multi,y)slearner_tau=slearner.fit_predict(X,w_multi,y)model_tau_feature=RandomForestRegressor()# specify model for model_tau_featureslearner.get_importance(X=X,tau=slearner_tau,model_tau_feature=model_tau_feature,normalize=True,method='auto',features=feature_names)# Using the feature_importances_ method in the base learner (LGBMRegressor() in this example)slearner.plot_importance(X=X,tau=slearner_tau,normalize=True,method='auto')# Using eli5's PermutationImportanceslearner.plot_importance(X=X,tau=slearner_tau,normalize=True,method='permutation')# Using SHAPshap_slearner=slearner.get_shap_values(X=X,tau=slearner_tau)# Plot shap values without specifying shap_dictslearner.plot_shap_values(X=X,tau=slearner_tau)# Plot shap values WITH specifying shap_dictslearner.plot_shap_values(X=X,shap_dict=shap_slearner)# interaction_idx set to 'auto' (searches for feature with greatest approximate interaction)slearner.plot_shap_dependence(treatment_group='treatment_A',feature_idx=1,X=X,tau=slearner_tau,interaction_idx='auto')
feature_name > threshold: For non-leaf node, the first line is an inequality indicating the splitting rule of this node to its children nodes.
impurity: the impurity is defined as the value of the split criterion function (such as KL, Chi, or ED) evaluated at this current node
total_sample: sample size in this node.
group_sample: sample sizes by treatment groups
uplift score: treatment effect in this node, if there are multiple treatment, it indicates the maximum (signed) of the treatment effects across all treatment vs control pairs.
uplift p_value: p value of the treatment effect in this node
validation uplift score: all the information above is static once the tree is trained (based on the trained trees), while the validation uplift score represents the treatment effect of the testing data when the method fill() is used. This score can be used as a comparison to the training uplift score, to evaluate if the tree has an overfitting issue.
Estimation of the treatment effect cannot be validated the same way as regular ML predictions because the true value is not available except for the experimental data. Here we focus on the internal validation methods under the assumption of unconfoundedness of potential outcomes and the treatment status conditioned on the feature set available to us.
We can validate the methodology by comparing the estimates with other approaches, checking the consistency of estimates across different levels and cohorts.
In meta-algorithms we can assess the quality of user-level treatment effect estimation by comparing estimates from different underlying ML algorithms. We will report MSE, coverage (overlapping 95% confidence interval), uplift curve. In addition, we can split the sample within a cohort and compare the result from out-of-sample scoring and within-sample scoring.
User Level/Segment Level/Cohort Level Consistency
We can also evaluate user-level/segment level/cohort level estimation consistency by conducting T-test.
Treatment effect may vary from cohort to cohort but should not be too volatile. For a given cohort, we will compare the scores generated by model fit to another score with the ones generated by its own model.
We can test the methodology with simulations, where we generate data with known causal and non-causal links between the outcome, treatment and some of confounding variables.
We implemented the following sets of synthetic data generation mechanisms based on [19]:
We can validate the estimation by evaluating and comparing the uplift gains with AUUC (Area Under Uplift Curve), it calculates cumulative gains. Please find more details in meta_learners_with_synthetic_data.ipynb example notebook.
fromcausalml.datasetimport*fromcausalml.metricsimport*# Single simulationtrain_preds,valid_preds=get_synthetic_preds_holdout(simulate_nuisance_and_easy_treatment,n=50000,valid_size=0.2)# Cumulative Gain AUUC values for a Single Simulation of Validaiton Dataget_synthetic_auuc(valid_preds)
Sensitivity analysis aim to check the robustness of the unconfoundeness assumption. If there is hidden bias (unobserved confounders), it detemineds how severe whould have to be to change conclusion by examine the average treatment effect estimation.
We implemented the following methods to conduct sensitivity analysis:
Blackwell(2013) <https://www.mattblackwell.org/files/papers/sens.pdf> introduced an approach to sensitivity analysis for causal effects that directly models confounding or selection bias.
One Sided Confounding Function: here as the name implies, this function can detect sensitivity to one-sided selection bias, but it would fail to detect other deviations from ignobility. That is, it can only determine the bias resulting from the treatment group being on average better off or the control group being on average better off.
Alignment Confounding Function: this type of bias is likely to occur when units select into treatment and control based on their predicted treatment effects
The sensitivity analysis is rigid in this way because the confounding function is not identified from the data, so that the causal model in the last section is only identified conditional on a specific choice of that function. The goal of the sensitivity analysis is not to choose the “correct” confounding function, since we have no way of evaluating this correctness. By its very nature, unmeasured confounding is unmeasured. Rather, the goal is to identify plausible deviations from ignobility and test sensitivity to those deviations. The main harm that results from the incorrect specification of the confounding function is that hidden biases remain hidden.
X_train – (np.ndarray), training subsample of feature matrix, (n_train_sample, n_features)
X_test – (np.ndarray), test subsample of feature matrix, (n_train_sample, n_features)
inbag – (ndarray, optional),
The inbag matrix that fit the data. If set to None (default) it
will be inferred from the forest. However, this only works for trees
for which bootstrapping was set to True. That is, if sampling was
done with replacement. Otherwise, users need to provide their own
inbag matrix.
calibrate – (boolean, optional)
Whether to apply calibration to mitigate Monte Carlo noise.
Some variance estimates may be negative due to Monte Carlo effects if
the number of trees in the forest is too small. To use calibration,
Default: True
memory_constrained – (boolean, optional)
Whether or not there is a restriction on memory. If False, it is
assumed that a ndarray of shape (n_train_sample,n_test_sample) fits
in main memory. Setting to True can actually provide a speedup if
memory_limit is tuned to the optimal range.
memory_limit – (int, optional)
An upper bound for how much memory the intermediate matrices will take
up in Megabytes. This must be provided if memory_constrained=True.
Returns:
(np.ndarray), An array with the unbiased sampling variance for a RandomForest object.
A Causal Tree regressor class.
The Causal Tree is a decision tree regressor with a split criteria for treatment effects.
Details are available at Athey and Imbens (2015).
Tree node class to contain all the statistics of the tree node.
Parameters:
classes (list of str) – A list of the control and treatment group names.
col (int, optional (default = -1)) – The column index for splitting the tree node to children nodes.
value (float, optional (default = None)) – The value of the feature column to split the tree node to children nodes.
trueBranch (object of DecisionTree) – The true branch tree node (feature > value).
falseBranch (object of DecisionTree) – The false branch tree node (feature > value).
results (list of float) – The classification probability P(Y=1|T) for each of the control and treatment groups
in the tree node.
summary (list of list) – Summary statistics of the tree nodes, including impurity, sample size, uplift score, etc.
maxDiffTreatment (int) – The treatment index generating the maximum difference between the treatment and control groups.
maxDiffSign (float) – The sign of the maximum difference (1. or -1.).
nodeSummary (list of list) – Summary statistics of the tree nodes [P(Y=1|T), N(T)], where y_mean stands for the target metric mean
and n is the sample size.
backupResults (list of float) – The positive probabilities in each of the control and treatment groups in the parent node. The parent node
information is served as a backup for the children node, in case no valid statistics can be calculated from the
children node, the parent node information will be used in certain cases.
bestTreatment (int) – The treatment index providing the best uplift (treatment effect).
upliftScore (list) – The uplift score of this node: [max_Diff, p_value], where max_Diff stands for the maximum treatment effect, and
p_value stands for the p_value of the treatment effect.
matchScore (float) – The uplift score by filling a trained tree with validation dataset or testing dataset.
n_estimators (integer, optional (default=10)) – The number of trees in the uplift random forest.
evaluationFunction (string) – Choose from one of the models: ‘KL’, ‘ED’, ‘Chi’, ‘CTS’, ‘DDP’, ‘IT’, ‘CIT’, ‘IDDP’.
max_features (int, optional (default=10)) – The number of features to consider when looking for the best split.
random_state (int, RandomState instance or None (default=None)) – A random seed or np.random.RandomState to control randomness in building the trees and forest.
max_depth (int, optional (default=5)) – The maximum depth of the tree.
min_samples_leaf (int, optional (default=100)) – The minimum number of samples required to be split at a leaf node.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group to be split at a leaf node.
n_reg (int, optional (default=10)) – The regularization parameter defined in Rzepakowski et al. 2012, the
weight (in terms of sample size) of the parent node influence on the
child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
early_stopping_eval_diff_scale (float, optional (default=1)) – If train and valid uplift score diff bigger than
min(train_uplift_score,valid_uplift_score)/early_stopping_eval_diff_scale, stop.
control_name (string) – The name of the control group (other experiment groups will be regarded as treatment groups)
normalization (boolean, optional (default=True)) – The normalization factor defined in Rzepakowski et al. 2012,
correcting for tests with large number of splits and imbalanced
treatment and control splits
honesty (bool (default=False)) – True if the honest approach based on “Athey, S., & Imbens, G. (2016). Recursive partitioning for
heterogeneous causal effects.” shall be used.
estimation_sample_size (float (default=0.5)) – Sample size for estimating the CATE score in the leaves if honesty == True.
n_jobs (int, optional (default=-1)) – The parallelization parameter to define how many parallel jobs need to be created.
This is passed on to joblib library for parallelizing uplift-tree creation and prediction.
joblib_prefer (str, optional (default="threads")) – The preferred backend for joblib (passed as prefer to joblib.Parallel). See the joblib
documentation for valid values.
Outputs –
---------- –
df_res (pandas dataframe) – A user-level results dataframe containing the estimated individual treatment effect.
staticbootstrap(X, treatment, y, X_val, treatment_val, y_val, tree)
fit(X, treatment, y, X_val=None, treatment_val=None, y_val=None)
Fit the UpliftRandomForestClassifier.
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
treatment (array-like, shape = [num_samples]) – An array containing the treatment group for each unit.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
X_val (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to valid the uplift model.
treatment_val (array-like, shape = [num_samples]) – An array containing the validation treatment group for each unit.
y_val (array-like, shape = [num_samples]) – An array containing the validation outcome of interest for each unit.
Returns the recommended treatment group and predicted optimal
probability conditional on using the recommended treatment group.
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
full_output (bool, optional (default=False)) – Whether the UpliftTree algorithm returns upliftScores, pred_nodes
alongside the recommended treatment group and p_hat in the treatment group.
Returns:
y_pred_list (ndarray, shape = (num_samples, num_treatments])) – An ndarray containing the predicted treatment effect of each treatment group for each sample
df_res (DataFrame, shape = [num_samples, (num_treatments * 2 + 3)]) – If full_output is True, a DataFrame containing the predicted outcome of each treatment and
control group, the treatment effect of each treatment group, the treatment group with the
highest treatment effect, and the maximum treatment effect for each sample.
A uplift tree classifier estimates the individual treatment effect by modifying the loss function in the
classification trees.
The uplift tree classifier is used in uplift random forest to construct the trees in the forest.
Parameters:
evaluationFunction (string) – Choose from one of the models: ‘KL’, ‘ED’, ‘Chi’, ‘CTS’, ‘DDP’, ‘IT’, ‘CIT’, ‘IDDP’.
max_features (int, optional (default=None)) – The number of features to consider when looking for the best split.
max_depth (int, optional (default=3)) – The maximum depth of the tree.
min_samples_leaf (int, optional (default=100)) – The minimum number of samples required to be split at a leaf node.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group to be split at a leaf node.
n_reg (int, optional (default=100)) – The regularization parameter defined in Rzepakowski et al. 2012, the weight (in terms of sample size) of the
parent node influence on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
early_stopping_eval_diff_scale (float, optional (default=1)) – If train and valid uplift score diff bigger than
min(train_uplift_score,valid_uplift_score)/early_stopping_eval_diff_scale, stop.
control_name (string) – The name of the control group (other experiment groups will be regarded as treatment groups).
normalization (boolean, optional (default=True)) – The normalization factor defined in Rzepakowski et al. 2012, correcting for tests with large number of splits
and imbalanced treatment and control splits.
honesty (bool (default=False)) – True if the honest approach based on “Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects.”
shall be used. If ‘IDDP’ is used as evaluation function, this parameter is automatically set to true.
estimation_sample_size (float (default=0.5)) – Sample size for estimating the CATE score in the leaves if honesty == True.
random_state (int, RandomState instance or None (default=None)) – A random seed or np.random.RandomState to control randomness in building a tree.
Calculate likelihood ratio test statistic as split evaluation criterion for a given node
NOTE: n_class should be 2.
Parameters:
cur_node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
cur_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
left_node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the left node, i.e. [P(Y=1|T=i)…]
left_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the left node, i.e. [N(T=i)…]
right_node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the right node, i.e. [P(Y=1|T=i)…]
right_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the right node, i.e. [N(T=i)…]
Calculate CTS (conditional treatment selection) as split evaluation criterion for a given node.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
Calculate Chi-Square statistic as split evaluation criterion for a given node.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
Calculate Delta P as split evaluation criterion for a given node.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
Calculate Euclidean Distance as split evaluation criterion for a given node.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
Calculate Delta P as split evaluation criterion for a given node.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
Calculate Squared T-Statistic as split evaluation criterion for a given node
NOTE: n_class should be 2.
Parameters:
left_node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the left node, i.e. [P(Y=1|T=i)…]
left_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the left node, i.e. [N(T=i)…]
right_node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the right node, i.e. [P(Y=1|T=i)…]
right_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the right node, i.e. [N(T=i)…]
Calculate KL Divergence as split evaluation criterion for a given node.
Modified to accept new node summary format.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
cur_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
left_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the left node, i.e. [N(T=i)…]
alpha (float) – The weight used to balance different normalization parts.
Classifies (prediction) the observations according to the tree.
Parameters:
observations (list of list) – The internal data format for the training data (combining X, Y, treatment).
dataMissing (boolean, optional (default = False)) – An indicator for if data are missing or not.
Returns:
The results in the leaf node.
Return type:
tree.results, tree.upliftScore
staticdivideSet(X, treatment_idx, y, column, value)
Tree node split.
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
treatment_idx (array-like, shape = [num_samples]) – An array containing the treatment group index for each unit.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
column (int) – The column used to split the data.
value (float or int) – The value in the column for splitting the data.
Returns:
(X_l, X_r, treatment_l, treatment_r, y_l, y_r) – The covariates, treatments and outcomes of left node and the right node.
Return type:
list of ndarray
staticdivideSet_len(X, treatment_idx, y, column, value)
Tree node split.
Modified from dividedSet(), but return the len(X_l) and
len(X_r) instead of the split X_l and X_r, to avoid some
overhead, intended to be used for finding the split. After
finding the best splits, can split to find the X_l and X_r.
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
treatment_idx (array-like, shape = [num_samples]) – An array containing the treatment group index for each unit.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
column (int) – The column used to split the data.
value (float or int) – The value in the column for splitting the data.
Returns:
(len_X_l, len_X_r, treatment_l, treatment_r, y_l, y_r) – The covariates nrows, treatments and outcomes of left node and the right node.
Return type:
list of ndarray
staticevaluate_CIT(currentNodeSummary, leftNodeSummary, rightNodeSummary, y_l, y_r, w_l, w_r, y, w)
Calculate likelihood ratio test statistic as split evaluation criterion for a given node
:param currentNodeSummary: The parent node summary statistics
:type currentNodeSummary: list of lists
:param leftNodeSummary: The left node summary statistics.
:type leftNodeSummary: list of lists
:param rightNodeSummary: The right node summary statistics.
:type rightNodeSummary: list of lists
:param y_l: An array containing the outcome of interest for each unit in the left node
:type y_l: array-like, shape = [num_samples]
:param y_r: An array containing the outcome of interest for each unit in the right node
:type y_r: array-like, shape = [num_samples]
:param w_l: An array containing the treatment for each unit in the left node
:type w_l: array-like, shape = [num_samples]
:param w_r: An array containing the treatment for each unit in the right node
:type w_r: array-like, shape = [num_samples]
:param y: An array containing the outcome of interest for each unit
:type y: array-like, shape = [num_samples]
:param w: An array containing the treatment for each unit
:type w: array-like, shape = [num_samples]
Fill the data into an existing tree.
This is a higher-level function to transform the original data inputs
into lower level data inputs (list of list and tree).
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
treatment (array-like, shape = [num_samples]) – An array containing the treatment group for each unit.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
treatment_idx (array-like, shape = [num_samples]) – An array containing the treatment group idx for each unit.
The dtype should be numpy.int8.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
X_val (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to valid the uplift model.
treatment_val_idx (array-like, shape = [num_samples]) – An array containing the validation treatment group idx for each unit.
y_val (array-like, shape = [num_samples]) – An array containing the validation outcome of interest for each unit.
max_depth (int, optional (default=10)) – The maximum depth of the tree.
min_samples_leaf (int, optional (default=100)) – The minimum number of samples required to be split at a leaf node.
depth (int, optional (default = 1)) – The current depth.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group to be split at a leaf node.
n_reg (int, optional (default=10)) – The regularization parameter defined in Rzepakowski et al. 2012,
the weight (in terms of sample size) of the parent node influence
on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
parentNodeSummary_p (array-like, shape [n_class]) – Node summary probability statistics of the parent tree node.
Apply the honest approach based on “Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects.”
:param X_est: An ndarray of the covariates used to calculate the unbiased estimates in the leafs of the decision tree.
:type X_est: ndarray, shape = [num_samples, num_features]
:param T_est: An array containing the treatment group for each unit.
:type T_est: array-like, shape = [num_samples]
:param Y_est: An array containing the outcome of interest for each unit.
:type Y_est: array-like, shape = [num_samples]
Returns the recommended treatment group and predicted optimal
probability conditional on using the recommended treatment group.
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
Returns:
pred – An ndarray of predicted treatment effects across treatments.
Return type:
ndarray, shape = [num_samples, num_treatments]
prune(X, treatment, y, minGain=0.0001, rule='maxAbsDiff')
Prune the uplift model.
:param X: An ndarray of the covariates used to train the uplift model.
:type X: ndarray, shape = [num_samples, num_features]
:param treatment: An array containing the treatment group for each unit.
:type treatment: array-like, shape = [num_samples]
:param y: An array containing the outcome of interest for each unit.
:type y: array-like, shape = [num_samples]
:param minGain: The minimum gain required to make a tree node split. The children
tree branches are trimmed if the actual split gain is less than
the minimum gain.
Parameters:
rule (string, optional (default = 'maxAbsDiff')) – The prune rules. Supported values are ‘maxAbsDiff’ for optimizing
the maximum absolute difference, and ‘bestUplift’ for optimizing
the node-size weighted treatment effect.
Returns:
self
Return type:
object
pruneTree(X, treatment_idx, y, tree, rule='maxAbsDiff', minGain=0.0, n_reg=0, parentNodeSummary=None)
Prune one single tree node in the uplift model.
:param X: An ndarray of the covariates used to train the uplift model.
:type X: ndarray, shape = [num_samples, num_features]
:param treatment_idx: An array containing the treatment group index for each unit.
:type treatment_idx: array-like, shape = [num_samples]
:param y: An array containing the outcome of interest for each unit.
:type y: array-like, shape = [num_samples]
:param rule: The prune rules. Supported values are ‘maxAbsDiff’ for optimizing the maximum absolute difference, and
‘bestUplift’ for optimizing the node-size weighted treatment effect.
Parameters:
minGain (float, optional (default = 0.)) – The minimum gain required to make a tree node split. The children tree branches are trimmed if the actual
split gain is less than the minimum gain.
n_reg (int, optional (default=0)) – The regularization parameter defined in Rzepakowski et al. 2012, the weight (in terms of sample size) of the
parent node influence on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
parentNodeSummary (list of list, optional (default = None)) – Node summary statistics, [P(Y=1|T), N(T)] of the parent tree node.
Returns:
self
Return type:
object
tree_node_summary(treatment_idx, y, min_samples_treatment=10, n_reg=100, parentNodeSummary=None)
Tree node summary statistics.
Parameters:
treatment_idx (array-like, shape = [num_samples]) – An array containing the treatment group index for each unit.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group t be split at a leaf node.
n_reg (int, optional (default=10)) – The regularization parameter defined in Rzepakowski et al. 2012,
the weight (in terms of sample size) of the parent node influence
on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
parentNodeSummary (list of list) – The positive probabilities and sample sizes of each of the control and treatment groups
in the parent node.
Returns:
nodeSummary – The positive probabilities and sample sizes of each of the control and treatment groups
in the current node.
Modified from tree_node_summary_to_arr, to use different
format for the summary and to calculate based on already
calculated group counts. Instead of [[P(Y=1|T=0), N(T=0)],
[P(Y=1|T=1), N(T=1)], …], use two arrays [N(T=i)…] and
[P(Y=1|T=i)…].
Parameters:
group_count_arr (array of shape [2*n_class]) – Has type numpy.int32.
The grounp counts, where entry 2*i is N(Y=0, T=i),
and entry 2*i+1 is N(Y=1, T=i).
out_summary_p (array of shape [n_class]) – Has type numpy.double.
To be filled with the positive probabilities of each of the control
and treament groups of the current node.
out_summary_n (array of shape [n_class]) – Has type numpy.int32.
To be filled with the counts of each of the control
and treament groups of the current node.
parentNodeSummary_p (array of shape [n_class]) – The positive probabilities of each of the control and treatment groups
in the parent node.
has_parent_summary (bool as int) – If True (non-zero), then parentNodeSummary_p is a valid parent node summary probabilities.
If False (0), assume no parent node summary and parentNodeSummary_p is not touched.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group t be split at a leaf node.
n_reg (int, optional (default=10)) – The regularization parameter defined in Rzepakowski et al. 2012,
the weight (in terms of sample size) of the parent node influence
on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
Return type:
No return values, but will modify out_summary_p and out_summary_n.
statictree_node_summary_to_arr(treatment_idx, y, out_summary_p, out_summary_n, buf_count_arr, parentNodeSummary_p, has_parent_summary, min_samples_treatment=10, n_reg=100)
Tree node summary statistics.
Modified from tree_node_summary, to use different format for the summary.
Instead of [[P(Y=1|T=0), N(T=0)], [P(Y=1|T=1), N(T=1)], …],
use two arrays [N(T=i)…] and [P(Y=1|T=i)…].
Parameters:
treatment_idx (array-like, shape = [num_samples]) – An array containing the treatment group index for each unit.
Has type numpy.int8.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
Has type numpy.int8.
out_summary_p (array of shape [n_class]) – Has type numpy.double.
To be filled with the positive probabilities of each of the control
and treament groups of the current node.
out_summary_n (array of shape [n_class]) – Has type numpy.int32.
To be filled with the counts of each of the control
and treament groups of the current node.
buf_count_arr (array of shape [2*n_class]) – Has type numpy.int32.
To be use as temporary buffer for group_uniqueCounts_to_arr.
parentNodeSummary_p (array of shape [n_class]) – The positive probabilities of each of the control and treatment groups
in the parent node.
has_parent_summary (bool as int) – If True (non-zero), then parentNodeSummary_p is a valid parent node summary probabilities.
If False (0), assume no parent node summary and parentNodeSummary_p is not touched.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group t be split at a leaf node.
n_reg (int, optional (default=10)) – The regularization parameter defined in Rzepakowski et al. 2012,
the weight (in terms of sample size) of the parent node influence
on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
Return type:
No return values, but will modify out_summary_p and out_summary_n.
Create distplot for tree leaves values
:param tree: (CausalTreeRegressor), Tree object
:param title: (str), plot title
:param figsize: (tuple), figure size
:param fontsize: (int), title font size
estimate_ate(X, treatment, y, p=None, bootstrap_ci=False, n_bootstraps=1000, bootstrap_size=10000, seed=None, pretrain=False)[source]
Estimate the Average Treatment Effect (ATE).
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
bootstrap_ci (bool) – whether run bootstrap for confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
seed (int) – random seed for cross-fitting
pretrain (bool) – whether a model has been fit, default False.
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
seed (int) – random seed for cross-fitting
fit_predict(X, treatment, y, p=None, return_ci=False, n_bootstraps=1000, bootstrap_size=10000, return_components=False, verbose=True, seed=None)[source]
Fit the treatment effect and outcome models of the R learner and predict treatment effects.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
return_ci (bool) – whether to return confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
return_components (bool, optional) – whether to return outcome for treatment and control seperately
verbose (str) – whether to output progress logs
seed (int) – random seed for cross-fitting
Returns:
Predictions of treatment effects. Output dim: [n_samples, n_treatment]
fit(X, treatment, y, p=None, sample_weight=None, verbose=True)[source]
Fit the treatment effect and outcome models of the R learner.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
sample_weight (np.array or pd.Series, optional) – an array of sample weights indicating the
weight of each observation for effect_learner. If None, it assumes equal weight.
verbose (bool, optional) – whether to output progress logs
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – only needed when pretrain=False, a treatment vector
y (np.array or pd.Series) – only needed when pretrain=False, an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
sample_weight (np.array or pd.Series, optional) – an array of sample weights indicating the
weight of each observation for effect_learner. If None, it assumes equal weight.
bootstrap_ci (bool) – whether run bootstrap for confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
pretrain (bool) – whether a model has been fit, default False.
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
fit(X, treatment, y, p=None, sample_weight=None, verbose=True)[source]
Fit the treatment effect and outcome models of the R learner.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
sample_weight (np.array or pd.Series, optional) – an array of sample weights indicating the
weight of each observation for effect_learner. If None, it assumes equal weight.
verbose (bool, optional) – whether to output progress logs
fit_predict(X, treatment, y, p=None, sample_weight=None, return_ci=False, n_bootstraps=1000, bootstrap_size=10000, verbose=True)[source]
Fit the treatment effect and outcome models of the R learner and predict treatment effects.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
sample_weight (np.array or pd.Series, optional) – an array of sample weights indicating the
weight of each observation for effect_learner. If None, it assumes equal weight.
return_ci (bool) – whether to return confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
verbose (bool) – whether to output progress logs
Returns:
Predictions of treatment effects. Output dim: [n_samples, n_treatment].
A parent class for S-learner classes.
An S-learner estimates treatment effects with one machine learning model.
Details of S-learner are available at Kunzel et al. (2018).
estimate_ate(X, treatment, y, p=None, return_ci=False, bootstrap_ci=False, n_bootstraps=1000, bootstrap_size=10000, pretrain=False)[source]
Estimate the Average Treatment Effect (ATE).
Parameters:
X (np.matrix, np.array, or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
return_ci (bool, optional) – whether to return confidence intervals
bootstrap_ci (bool) – whether to return confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
pretrain (bool) – whether a model has been fit, default False.
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
Fit the inference model
:param X: a feature matrix
:type X: np.matrix, np.array, or pd.Dataframe
:param treatment: a treatment vector
:type treatment: np.array or pd.Series
:param y: an outcome vector
:type y: np.array or pd.Series
fit_predict(X, treatment, y, p=None, return_ci=False, n_bootstraps=1000, bootstrap_size=10000, return_components=False, verbose=True)[source]
Fit the inference model of the S learner and predict treatment effects.
:param X: a feature matrix
:type X: np.matrix, np.array, or pd.Dataframe
:param treatment: a treatment vector
:type treatment: np.array or pd.Series
:param y: an outcome vector
:type y: np.array or pd.Series
:param return_ci: whether to return confidence intervals
:type return_ci: bool, optional
:param n_bootstraps: number of bootstrap iterations
:type n_bootstraps: int, optional
:param bootstrap_size: number of samples per bootstrap
:type bootstrap_size: int, optional
:param return_components: whether to return outcome for treatment and control seperately
:type return_components: bool, optional
:param verbose: whether to output progress logs
:type verbose: bool, optional
Returns:
Predictions of treatment effects. Output dim: [n_samples, n_treatment].
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series, optional) – a treatment vector
y (np.array or pd.Series, optional) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
return_components (bool, optional) – whether to return outcome for treatment and control seperately
return_p_score (bool, optional) – whether to return propensity score
verbose (bool, optional) – whether to output progress logs
estimate_ate(X, treatment, y, p=None, bootstrap_ci=False, n_bootstraps=1000, bootstrap_size=10000, pretrain=False)[source]
Estimate the Average Treatment Effect (ATE).
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
bootstrap_ci (bool) – whether run bootstrap for confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
pretrain (bool) – whether a model has been fit, default False.
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
fit_predict(X, treatment, y, p=None, return_ci=False, n_bootstraps=1000, bootstrap_size=10000, return_components=False, verbose=True)[source]
Fit the treatment effect and outcome models of the R learner and predict treatment effects.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
return_ci (bool) – whether to return confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
return_components (bool, optional) – whether to return outcome for treatment and control seperately
verbose (str) – whether to output progress logs
Returns:
Predictions of treatment effects. Output dim: [n_samples, n_treatment]
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series, optional) – a treatment vector
y (np.array or pd.Series, optional) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
return_components (bool, optional) – whether to return outcome for treatment and control seperately
verbose (bool, optional) – whether to output progress logs
estimate_ate(X, treatment, y, p=None, pretrain=False)[source]
Estimate the Average Treatment Effect (ATE).
:param X: a feature matrix
:type X: np.matrix, np.array, or pd.Dataframe
:param treatment: a treatment vector
:type treatment: np.array or pd.Series
:param y: an outcome vector
:type y: np.array or pd.Series
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
Ref: Gruber, S., & Van Der Laan, M. J. (2009). Targeted maximum likelihood estimation: A gentle introduction.
estimate_ate(X, treatment, y, p, segment=None, return_ci=False)[source]
Estimate the Average Treatment Effect (ATE).
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict) – an array of propensity scores of float (0,1) in the single-treatment
case; or, a dictionary of treatment groups that map to propensity vectors of float (0,1)
segment (np.array, optional) – An optional segment vector of int. If given, the ATE and its CI will be
estimated for each segment.
return_ci (bool, optional) – Whether to return confidence intervals
Returns:
The ATE and its confidence interval (LB, UB) for each treatment, t and segment, s
fit(X, treatment, y, p=None, sample_weight=None, verbose=True)[source]
Fit the treatment effect and outcome models of the R learner.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
sample_weight (np.array or pd.Series, optional) – an array of sample weights indicating the
weight of each observation for effect_learner. If None, it assumes equal weight.
verbose (bool, optional) – whether to output progress logs
Estimate the Average Treatment Effect (ATE) for compliers.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
assignment (np.array or pd.Series) – an assignment vector. The assignment is the
instrumental variable that does not depend on unknown confounders. The assignment status
influences treatment in a monotonic way, i.e. one can only be more likely to take the
treatment if assigned.
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (2-tuple of np.ndarray or pd.Series or dict, optional) – The first (second) element corresponds to
unassigned (assigned) units. Each is an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of float
(0,1). If None will run ElasticNetPropensityModel() to generate the propensity scores.
pZ (np.array or pd.Series, optional) – an array of assignment probability of float (0,1); if None
will run ElasticNetPropensityModel() to generate the assignment probability score.
bootstrap_ci (bool) – whether run bootstrap for confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
seed (int) – random seed for cross-fitting
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
fit(X, assignment, treatment, y, p=None, pZ=None, seed=None, calibrate=True)[source]
Fit the inference model.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
assignment (np.array or pd.Series) – a (0,1)-valued assignment vector. The assignment is the
instrumental variable that does not depend on unknown confounders. The assignment status
influences treatment in a monotonic way, i.e. one can only be more likely to take the
treatment if assigned.
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (2-tuple of np.ndarray or pd.Series or dict, optional) – The first (second) element corresponds to
unassigned (assigned) units. Each is an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of float
(0,1). If None will run ElasticNetPropensityModel() to generate the propensity scores.
pZ (np.array or pd.Series, optional) – an array of assignment probability of float (0,1); if None
will run ElasticNetPropensityModel() to generate the assignment probability score.
Fit the treatment effect and outcome models of the R learner and predict treatment effects.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
assignment (np.array or pd.Series) – a (0,1)-valued assignment vector. The assignment is the
instrumental variable that does not depend on unknown confounders. The assignment status
influences treatment in a monotonic way, i.e. one can only be more likely to take the
treatment if assigned.
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (2-tuple of np.ndarray or pd.Series or dict, optional) – The first (second) element corresponds to
unassigned (assigned) units. Each is an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of float
(0,1). If None will run ElasticNetPropensityModel() to generate the propensity scores.
pZ (np.array or pd.Series, optional) – an array of assignment probability of float (0,1); if None
will run ElasticNetPropensityModel() to generate the assignment probability score.
return_ci (bool) – whether to return confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
return_components (bool, optional) – whether to return outcome for treatment and control seperately
verbose (str) – whether to output progress logs
seed (int) – random seed for cross-fitting
Returns:
Predictions of treatment effects for compliers, , i.e. those individuals
who take the treatment only if they are assigned. Output dim: [n_samples, n_treatment]
If return_ci, returns CATE [n_samples, n_treatment], LB [n_samples, n_treatment],
UB [n_samples, n_treatment]
Builds a model (using X to predict estimated/actual tau), and then calculates feature importances
based on a specified method.
Currently supported methods are:
auto (calculates importance based on estimator’s default implementation of feature importance;
estimator must be tree-based)
Note: if none provided, it uses lightgbm’s LGBMRegressor as estimator, and “gain” as
importance type
permutation (calculates importance based on mean decrease in accuracy when a feature column is permuted;
estimator can be any form)
Hint: for permutation, downsample data for better performance especially if X.shape[1] is large
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
tau (np.array) – a treatment effect vector (estimated/actual)
model_tau_feature (sklearn/lightgbm/xgboost model object) – an unfitted model object
features (np.array) – list/array of feature names. If None, an enumerated list will be used
method (str) – auto, permutation
normalize (bool) – normalize by sum of importances if method=auto (defaults to True)
test_size (float/int) – if float, represents the proportion of the dataset to include in the test split.
If int, represents the absolute number of test samples (used for estimating
permutation importance)
random_state (int/RandomState instance/None) – random state used in permutation importance estimation
Builds a model (using X to predict estimated/actual tau), and then calculates shapley values.
:param X: a feature matrix
:type X: np.matrix or np.array or pd.Dataframe
:param tau: a treatment effect vector (estimated/actual)
:type tau: np.array
:param model_tau_feature: an unfitted model object
:type model_tau_feature: sklearn/lightgbm/xgboost model object
:param features: list/array of feature names. If None, an enumerated list will be used.
:type features: optional, np.array
Builds a model (using X to predict estimated/actual tau), and then plots feature importances
based on a specified method.
Currently supported methods are:
auto (calculates importance based on estimator’s default implementation of feature importance;
estimator must be tree-based)
Note: if none provided, it uses lightgbm’s LGBMRegressor as estimator, and “gain” as
importance type
permutation (calculates importance based on mean decrease in accuracy when a feature column is permuted;
estimator can be any form)
Hint: for permutation, downsample data for better performance especially if X.shape[1] is large
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
tau (np.array) – a treatment effect vector (estimated/actual)
model_tau_feature (sklearn/lightgbm/xgboost model object) – an unfitted model object
features (optional, np.array) – list/array of feature names. If None, an enumerated list will be used
method (str) – auto, permutation
normalize (bool) – normalize by sum of importances if method=auto (defaults to True)
test_size (float/int) – if float, represents the proportion of the dataset to include in the test split.
If int, represents the absolute number of test samples (used for estimating
permutation importance)
random_state (int/RandomState instance/None) – random state used in permutation importance estimation
Plots dependency of shapley values for a specified feature, colored by an interaction feature.
If shapley values have been pre-computed, pass it through the shap_dict parameter.
If shap_dict is not provided, this builds a new model (using X to predict estimated/actual tau),
and then calculates shapley values.
This plots the value of the feature on the x-axis and the SHAP value of the same feature
on the y-axis. This shows how the model depends on the given feature, and is like a
richer extension of the classical partial dependence plots. Vertical dispersion of the
data points represents interaction effects.
Parameters:
treatment_group (str or int) – name of treatment group to create dependency plot on
feature_idx (str or int) – feature index / name to create dependency plot on
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
tau (np.array) – a treatment effect vector (estimated/actual)
model_tau_feature (sklearn/lightgbm/xgboost model object) – an unfitted model object
features (optional, np.array) – list/array of feature names. If None, an enumerated list will be used.
shap_dict (optional, dict) – a dict of shapley value matrices. If None, shap_dict will be computed.
interaction_idx (optional, str or int) – feature index / name used in coloring scheme as interaction feature.
If “auto” then shap.common.approximate_interactions is used to pick what seems to be the
strongest interaction (note that to find to true strongest interaction you need to compute
the SHAP interaction values).
If shapley values have been pre-computed, pass it through the shap_dict parameter.
If shap_dict is not provided, this builds a new model (using X to predict estimated/actual tau),
and then calculates shapley values.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix. Required if shap_dict is None.
tau (np.array) – a treatment effect vector (estimated/actual)
model_tau_feature (sklearn/lightgbm/xgboost model object) – an unfitted model object
features (optional, np.array) – list/array of feature names. If None, an enumerated list will be used.
shap_dict (optional, dict) – a dict of shapley value matrices. If None, shap_dict will be computed.
The organic conversion rate in the population without an intervention.
If None, the organic conversion rate is obtained from tne control group.
NB: The organic conversion in the control group is not always the same
as the organic conversion rate without treatment.
data (DataFrame) – A pandas DataFrame containing the features, treatment assignment
indicator and the outcome of interest.
treatment (string) – A string corresponding to the name of the treatment column. The
assumed coding in the column is 1 for treatment and 0 for control.
outcome (string) – A string corresponding to the name of the outcome column. The assumed
coding in the column is 1 for conversion and 0 for no conversion.
treatment (array, shape = (num_samples, )) – An array of treatment group indicator values.
control_name (string) – The name of the control condition as a string. Must be contained in the treatment array.
treatment_names (list, length = cate.shape[1]) – A list of treatment group names. NB: The order of the items in the
list must correspond to the order in which the conditional average
treatment effect estimates are in cate_array.
y_proba (array, shape = (num_samples, )) – The predicted probability of conversion using the Y ~ X model across
the total sample.
cate (array, shape = (num_samples, len(set(treatment)))) – Conditional average treatment effect estimations from any model.
value (array, shape = (num_samples, )) – Value of converting each unit.
conversion_cost (shape = (num_samples, len(set(treatment)))) – The cost of a treatment that is triggered if a unit converts after having been in the treatment, such as a
promotion code.
impression_cost (shape = (num_samples, len(set(treatment)))) – The cost of a treatment that is the same for each unit whether or not they convert, such as a cost associated
with a promotion channel.
Notes
Because we get the conditional average treatment effects from
cate-learners relative to the control condition, we subtract the
cate for the unit in their actual treatment group from y_proba for that
unit, in order to recover the control outcome. We then add the cates
to the control outcome to obtain y_proba under each condition. These
outcomes are counterfactual because just one of them is actually
observed.
To capture the counterfactual notation, we use 1 and 0 to indicate the actual and
counterfactual values of a variable, respectively, and we use do to indicate the effect
of an intervention.
The experimental and observational data are either assumed to come to the same population,
or from random samples of the population. If the data are from a sample, the bounds may
be incorrectly calculated because the relevant quantities in the Tian-Pearl equations are
defined e.g. as \(P(Y|do(T))\), not \(P(Y|do(T), S)\) where \(S\) corresponds to sample selection.
Bareinboim and Pearl (2016) discuss conditions
under which \(P(Y|do(T))\) can be recovered from \(P(Y|do(T), S)\).
Get auuc values for cumulative gains of model estimates in quantiles.
For details, reference get_cumgain() and plot_gain()
:param synthetic_preds: dictionary of predictions generated by get_synthetic_preds()
:type synthetic_preds: dict
:param or get_synthetic_preds_holdout():
:param outcome_col: the column name for the actual outcome
:type outcome_col: str, optional
:param treatment_col: the column name for the treatment indicator (0 or 1)
:type treatment_col: str, optional
:param treatment_effect_col: the column name for the true treatment effect
:type treatment_effect_col: str, optional
:param plot: plot the cumulative gain chart or not
:type plot: boolean,optional
Returns:
auuc values by learner for cumulative gains of model estimates
Generate a synthetic dataset for classification uplift modeling problem.
Parameters:
n_samples (int, optional (default=1000)) – The number of samples to be generated for each treatment group.
treatment_name (list, optional (default = ['control','treatment1','treatment2','treatment3'])) – The list of treatment names.
y_name (string, optional (default = 'conversion')) – The name of the outcome variable to be used as a column in the output dataframe.
n_classification_features (int, optional (default = 10)) – Total number of features for base classification
n_classification_informative (int, optional (default = 5)) – Total number of informative features for base classification
n_classification_redundant (int, optional (default = 0)) – Total number of redundant features for base classification
n_classification_repeated (int, optional (default = 0)) – Total number of repeated features for base classification
n_uplift_increase_dict (dictionary, optional (default: {'treatment1': 2, 'treatment2': 2, 'treatment3': 2})) – Number of features for generating positive treatment effects for corresponding treatment group.
Dictionary of {treatment_key: number_of_features_for_increase_uplift}.
n_uplift_decrease_dict (dictionary, optional (default: {'treatment1': 0, 'treatment2': 0, 'treatment3': 0})) – Number of features for generating negative treatment effects for corresponding treatment group.
Dictionary of {treatment_key: number_of_features_for_increase_uplift}.
delta_uplift_increase_dict (dictionary, optional (default: {'treatment1': .02, 'treatment2': .05, 'treatment3': .1})) – Positive treatment effect created by the positive uplift features on the base classification label.
Dictionary of {treatment_key: increase_delta}.
delta_uplift_decrease_dict (dictionary, optional (default: {'treatment1': 0., 'treatment2': 0., 'treatment3': 0.})) – Negative treatment effect created by the negative uplift features on the base classification label.
Dictionary of {treatment_key: increase_delta}.
n_uplift_increase_mix_informative_dict (dictionary, optional) – Number of positive mix features for each treatment. The positive mix feature is defined as a linear combination
of a randomly selected informative classification feature and a randomly selected positive uplift feature.
The linear combination is made by two coefficients sampled from a uniform distribution between -1 and 1.
default: {‘treatment1’: 1, ‘treatment2’: 1, ‘treatment3’: 1}
n_uplift_decrease_mix_informative_dict (dictionary, optional) – Number of negative mix features for each treatment. The negative mix feature is defined as a linear combination
of a randomly selected informative classification feature and a randomly selected negative uplift feature. The
linear combination is made by two coefficients sampled from a uniform distribution between -1 and 1.
default: {‘treatment1’: 0, ‘treatment2’: 0, ‘treatment3’: 0}
positive_class_proportion (float, optional (default = 0.5)) – The proportion of positive label (1) in the control group.
random_seed (int, optional (default = 20190101)) – The random seed to be used in the data generation process.
Returns:
df_res (DataFrame) – A data frame containing the treatment label, features, and outcome variable.
x_name (list) – The list of feature names generated.
Notes
The algorithm for generating the base classification dataset is adapted from the make_classification method in the
sklearn package, that uses the algorithm in Guyon [1] designed to generate the “Madelon” dataset.
Generate a synthetic dataset for classification uplift modeling problem.
Parameters:
n_samples (int, optional (default=1000)) – The number of samples to be generated for each treatment group.
treatment_name (list, optional (default = ['control','treatment1','treatment2','treatment3'])) – The list of treatment names. The first element must be ‘control’ as control group, and the rest are treated as
treatment groups.
y_name (string, optional (default = 'conversion')) – The name of the outcome variable to be used as a column in the output dataframe.
n_classification_features (int, optional (default = 10)) – Total number of features for base classification
n_classification_informative (int, optional (default = 5)) – Total number of informative features for base classification
n_classification_redundant (int, optional (default = 0)) – Total number of redundant features for base classification
n_classification_repeated (int, optional (default = 0)) – Total number of repeated features for base classification
n_uplift_dict (dictionary, optional (default: {'treatment1': 2, 'treatment2': 2, 'treatment3': 3})) – Number of features for generating heterogeneous treatment effects for corresponding treatment group.
Dictionary of {treatment_key: number_of_features_for_uplift}.
n_mix_informative_uplift_dict (dictionary, optional (default: {'treatment1': 1, 'treatment2': 1, 'treatment3': 1})) – Number of mix features for each treatment. The mix feature is defined as a linear combination
of a randomly selected informative classification feature and a randomly selected uplift feature.
The mixture is made by a weighted sum (p*feature1 + (1-p)*feature2), where the weight p is drawn from a uniform
distribution between 0 and 1.
delta_uplift_dict (dictionary, optional (default: {'treatment1': .02, 'treatment2': .05, 'treatment3': -.05})) – Treatment effect (delta), can be positive or negative.
Dictionary of {treatment_key: delta}.
positive_class_proportion (float, optional (default = 0.1)) – The proportion of positive label (1) in the control group, or the mean of outcome variable for control group.
random_seed (int, optional (default = 20200101)) – The random seed to be used in the data generation process.
feature_association_list (list, optional (default = ['linear','quadratic','cubic','relu','sin','cos'])) – List of uplift feature association patterns to the treatment effect. For example, if the feature pattern is
‘quadratic’, then the treatment effect will increase or decrease quadratically with the feature.
The values in the list must be one of (‘linear’,’quadratic’,’cubic’,’relu’,’sin’,’cos’). However, the same
value can appear multiple times in the list.
random_select_association (boolean, optional (default = True)) – How the feature patterns are selected from the feature_association_list to be applied in the data generation
process. If random_select_association = True, then for every uplift feature, a random feature association
pattern is selected from the list. If random_select_association = False, then the feature association pattern
is selected from the list in turns to be applied to each feature one by one.
error_std (float, optional (default = 0.05)) – Standard deviation to be used in the error term of the logistic regression. The error is drawn from a normal
distribution with mean 0 and standard deviation specified in this argument.
Returns:
df1 (DataFrame) – A data frame containing the treatment label, features, and outcome variable.
x_name (list) – The list of feature names generated.
Synthetic data in Nie X. and Wager S. (2018) ‘Quasi-Oracle Estimation of Heterogeneous Treatment Effects’
:param mode: mode of the simulation: 1 for difficult nuisance components and an easy treatment effect. 2 for a randomized trial. 3 for an easy propensity and a difficult baseline. 4 for unrelated treatment and control groups. 5 for a hidden confounder biasing treatment.
:type mode: int, optional
:param n: number of observations
:type n: int, optional
:param p: number of covariates (>=5)
:type p: int optional
:param sigma: standard deviation of the error term
:type sigma: float
:param adj: adjustment term for the distribution of propensity, e. Higher values shift the distribution to 0.
It does not apply to mode == 2 or 3.
Returns:
Synthetically generated samples with the following outputs:
y ((n,)-array): outcome variable.
X ((n,p)-ndarray): independent variables.
w ((n,)-array): treatment flag with value 0 or 1.
tau ((n,)-array): individual treatment effect.
b ((n,)-array): expected outcome.
e ((n,)-array): propensity of receiving treatment.
A Sensitivity Check class to support Placebo Treatment, Irrelevant Additional Confounder
and Subset validation refutation methods to verify causal inference.
method (list of str) – a list of sensitivity analysis method
sample_size (float, optional) – ratio for subset the original data
confound (string, optional) – the name of confouding function
alpha_range (np.array, optional) – a parameter to pass the confounding function
Returns:
a feature matrix
p (np.array): a propensity score vector between 0 and 1
treatment (np.array): a treatment vector (1 if treated, otherwise 0)
y (np.array): an outcome vector
Check partial rsqs values of feature corresponding confounding amonunt of ATE
:param sens_df: a data frame output from causalsens
:type sens_df: pandas.DataFrame
:param feature_name: feature name to check
:type feature_name: str
:param partial_rsqs_value: partial rsquare value of feature
:type partial_rsqs_value: float
:param range: range to search from sens_df
:type range: float
Plot the results of a sensitivity analysis against unmeasured
:param sens_df: a data frame output from causalsens
:type sens_df: pandas.DataFrame
:param partial_rsqs_d: a data frame output from causalsens including partial rsqure
:type partial_rsqs_d: pandas.DataFrame
:param type: the type of plot to draw, ‘raw’ or ‘r.squared’ are supported
:type type: str, optional
:param ci: whether plot confidence intervals
:type ci: bool, optional
:param partial_rsqs: whether plot partial rsquare results
:type partial_rsqs: bool, optional
Calculate the AUUC (Area Under the Uplift Curve) score.
Args:
df (pandas.DataFrame): a data frame with model estimates and actual data as columns
outcome_col (str, optional): the column name for the actual outcome
treatment_col (str, optional): the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional): the column name for the true treatment effect
normalize (bool, optional): whether to normalize the y-axis to 1 or not
w (numpy.array, optional) – a treatment vector (1 or True: treatment, 0 or False: control). If given, log
metrics for the treatment and control group separately
metrics (dict, optional) – a dictionary of the metric names and functions
Get cumulative gains of model estimates in population.
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the cumulative gain of the true treatment effect in each population.
Otherwise, it’s calculated as the cumulative difference between the mean outcomes
of the treatment and control groups in each population.
For details, see Section 4.1 of Gutierrez and G{‘e}rardy (2016), Causal Inference
and Uplift Modeling: A review of the literature.
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
normalize (bool, optional) – whether to normalize the y-axis to 1 or not
random_seed (int, optional) – random seed for numpy.random.rand()
Get average uplifts of model estimates in cumulative population.
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the mean of the true treatment effect in each of cumulative population.
Otherwise, it’s calculated as the difference between the mean outcomes of the
treatment and control groups in each of cumulative population.
For details, see Section 4.1 of Gutierrez and G{‘e}rardy (2016), Causal Inference
and Uplift Modeling: A review of the literature.
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
random_seed (int, optional) – random seed for numpy.random.rand()
Returns:
average uplifts of model estimates in cumulative population
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the cumulative gain of the true treatment effect in each population.
Otherwise, it’s calculated as the cumulative difference between the mean outcomes
of the treatment and control groups in each population.
For details, see Radcliffe (2007), Using Control Group to Target on Predicted Lift:
Building and Assessing Uplift Models
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
normalize (bool, optional) – whether to normalize the y-axis to 1 or not
random_seed (int, optional) – random seed for numpy.random.rand()
y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –
Defines aggregating of multiple output values.
Array-like value defines weights used to average errors.
’raw_values’ :
Returns a full set of errors in case of multioutput input.
’uniform_average’ :
Errors of all outputs are averaged with uniform weight.
Returns:
loss – If multioutput is ‘raw_values’, then mean absolute error is returned
for each output separately.
If multioutput is ‘uniform_average’ or an ndarray of weights, then the
weighted average of all output errors is returned.
MAE output is non-negative floating point. The best value is 0.0.
Plot the cumulative gain chart (or uplift curve) of model estimates.
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the cumulative gain of the true treatment effect in each population.
Otherwise, it’s calculated as the cumulative difference between the mean outcomes
of the treatment and control groups in each population.
For details, see Section 4.1 of Gutierrez and G{‘e}rardy (2016), Causal Inference
and Uplift Modeling: A review of the literature.
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
normalize (bool, optional) – whether to normalize the y-axis to 1 or not
random_seed (int, optional) – random seed for numpy.random.rand()
n (int, optional) – the number of samples to be used for plotting
Plot the lift chart of model estimates in cumulative population.
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the mean of the true treatment effect in each of cumulative population.
Otherwise, it’s calculated as the difference between the mean outcomes of the
treatment and control groups in each of cumulative population.
For details, see Section 4.1 of Gutierrez and G{‘e}rardy (2016), Causal Inference
and Uplift Modeling: A review of the literature.
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
random_seed (int, optional) – random seed for numpy.random.rand()
n (int, optional) – the number of samples to be used for plotting
Plot the Qini chart (or uplift curve) of model estimates.
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the cumulative gain of the true treatment effect in each population.
Otherwise, it’s calculated as the cumulative difference between the mean outcomes
of the treatment and control groups in each population.
For details, see Radcliffe (2007), Using Control Group to Target on Predicted Lift:
Building and Assessing Uplift Models
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
normalize (bool, optional) – whether to normalize the y-axis to 1 or not
random_seed (int, optional) – random seed for numpy.random.rand()
n (int, optional) – the number of samples to be used for plotting
ci (bool, optional) – whether return confidence intervals for ATE or not
Calculate the Qini score: the area between the Qini curves of a model and random.
For details, see Radcliffe (2007), Using Control Group to Target on Predicted Lift:
Building and Assessing Uplift Models
Args:
df (pandas.DataFrame): a data frame with model estimates and actual data as columns
outcome_col (str, optional): the column name for the actual outcome
treatment_col (str, optional): the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional): the column name for the true treatment effect
normalize (bool, optional): whether to normalize the y-axis to 1 or not
\(R^2\) (coefficient of determination) regression score function.
Best possible score is 1.0 and it can be negative (because the
model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features,
would get a \(R^2\) score of 0.0.
Read more in the User Guide.
Parameters:
y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
multioutput ({'raw_values', 'uniform_average', 'variance_weighted'}, array-like of shape (n_outputs,) or None, default='uniform_average') –
Defines aggregating of multiple output scores.
Array-like value defines weights used to average scores.
Default is “uniform_average”.
’raw_values’ :
Returns a full set of scores in case of multioutput input.
’uniform_average’ :
Scores of all outputs are averaged with uniform weight.
’variance_weighted’ :
Scores of all outputs are averaged, weighted by the variances
of each individual output.
Changed in version 0.19: Default value of multioutput is ‘uniform_average’.
Returns:
z – The \(R^2\) score or ndarray of scores if ‘multioutput’ is
‘raw_values’.
Return type:
float or ndarray of floats
Notes
This is not a symmetric function.
Unlike most other scores, \(R^2\) score may be negative (it need not
actually be the square of a quantity R).
This metric is not well-defined for single samples and will return a NaN
value if n_samples is less than two.
w (numpy.array, optional) – a treatment vector (1 or True: treatment, 0 or False: control). If given, log
metrics for the treatment and control group separately
metrics (dict, optional) – a dictionary of the metric names and functions
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC)
from prediction scores.
Note: this implementation can be used with binary, multiclass and
multilabel classification, but some restrictions apply (see Parameters).
Read more in the User Guide.
Parameters:
y_true (array-like of shape (n_samples,) or (n_samples, n_classes)) – True labels or binary label indicators. The binary and multiclass cases
expect labels with shape (n_samples,) while the multilabel case expects
binary label indicators with shape (n_samples, n_classes).
y_score (array-like of shape (n_samples,) or (n_samples, n_classes)) –
Target scores.
In the binary case, it corresponds to an array of shape
(n_samples,). Both probability estimates and non-thresholded
decision values can be provided. The probability estimates correspond
to the probability of the class with the greater label,
i.e. estimator.classes_[1] and thus
estimator.predict_proba(X, y)[:, 1]. The decision values
corresponds to the output of estimator.decision_function(X, y).
See more information in the User guide;
In the multiclass case, it corresponds to an array of shape
(n_samples, n_classes) of probability estimates provided by the
predict_proba method. The probability estimates must
sum to 1 across the possible classes. In addition, the order of the
class scores must correspond to the order of labels,
if provided, or else to the numerical or lexicographical order of
the labels in y_true. See more information in the
User guide;
In the multilabel case, it corresponds to an array of shape
(n_samples, n_classes). Probability estimates are provided by the
predict_proba method and the non-thresholded decision values by
the decision_function method. The probability estimates correspond
to the probability of the class with the greater label for each
output of the classifier. See more information in the
User guide.
average ({'micro', 'macro', 'samples', 'weighted'} or None, default='macro') –
If None, the scores for each class are returned. Otherwise,
this determines the type of averaging performed on the data:
Note: multiclass ROC AUC currently only handles the ‘macro’ and
‘weighted’ averages.
'micro':
Calculate metrics globally by considering each element of the label
indicator matrix as a label.
'macro':
Calculate metrics for each label, and find their unweighted
mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average, weighted
by support (the number of true instances for each label).
'samples':
Calculate metrics for each instance, and find their average.
Will be ignored when y_true is binary.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
max_fpr (float > 0 and <= 1, default=None) – If not None, the standardized partial AUC [2] over the range
[0, max_fpr] is returned. For the multiclass case, max_fpr,
should be either equal to None or 1.0 as AUC ROC partial
computation currently is not supported for multiclass.
Only used for multiclass targets. Determines the type of configuration
to use. The default value raises an error, so either
'ovr' or 'ovo' must be passed explicitly.
'ovr':
Stands for One-vs-rest. Computes the AUC of each class
against the rest [3][4]. This
treats the multiclass case in the same way as the multilabel case.
Sensitive to class imbalance even when average=='macro',
because class imbalance affects the composition of each of the
‘rest’ groupings.
'ovo':
Stands for One-vs-one. Computes the average AUC of all
possible pairwise combinations of classes [5].
Insensitive to class imbalance when
average=='macro'.
labels (array-like of shape (n_classes,), default=None) – Only used for multiclass targets. List of labels that index the
classes in y_score. If None, the numerical or lexicographical
order of the labels in y_true is used.
>>> importnumpyasnp>>> fromsklearn.datasetsimportmake_multilabel_classification>>> fromsklearn.multioutputimportMultiOutputClassifier>>> X,y=make_multilabel_classification(random_state=0)>>> clf=MultiOutputClassifier(clf).fit(X,y)>>> # get a list of n_output containing probability arrays of shape>>> # (n_samples, n_classes)>>> y_pred=clf.predict_proba(X)>>> # extract the positive columns for each output>>> y_pred=np.transpose([pred[:,1]forprediny_pred])>>> roc_auc_score(y,y_pred,average=None)array([0.82..., 0.86..., 0.94..., 0.85... , 0.94...])>>> fromsklearn.linear_modelimportRidgeClassifierCV>>> clf=RidgeClassifierCV().fit(X,y)>>> roc_auc_score(y,clf.decision_function(X),average=None)array([0.81..., 0.84... , 0.93..., 0.87..., 0.94...])
Rank features based on the chosen divergence measure.
Parameters:
data (pd.Dataframe) – DataFrame containing outcome, features, and experiment group
treatment_indicator (string) – the column name for binary indicator of treatment (1) or control (0)
features (list of string) – list of feature names, that are columns in the data DataFrame
y_name (string) – name of the outcome variable
method (string, optional, default = 'KL') – taking one of the following values {‘F’, ‘LR’, ‘KL’, ‘ED’, ‘Chi’}
The feature selection method to be used to rank the features.
‘F’ for F-test
‘LR’ for likelihood ratio test
‘KL’, ‘ED’, ‘Chi’ for bin-based uplift filter methods, KL divergence, Euclidean distance, Chi-Square
respectively
experiment_group_column (string, optional, default = 'treatment_group_key') – the experiment column name in
the DataFrame, which contains the treatment and control assignment label
control_group (string, optional, default = 'control') – name for control group, value in the experiment
group column
n_bins (int, optional, default = 10) – number of bins to be used for bin-based uplift filter methods
null_impute (str, optional, default=None) – impute np.nan present in the data taking on of the followin
strategy values {‘mean’, ‘median’, ‘most_frequent’, None}. If Value is None and null is present then
exception will be raised
Returns:
pd.DataFrame
a data frame containing the feature importance statistics
Rank features based on the F-statistics of the interaction.
Parameters:
data (pd.Dataframe) – DataFrame containing outcome, features, and experiment group
treatment_indicator (string) – the column name for binary indicator of treatment (1) or control (0)
features (list of string) – list of feature names, that are columns in the data DataFrame
y_name (string) – name of the outcome variable
order (int) – the order of feature to be evaluated with the treatment effect, order takes 3 values: 1,2,3.
order = 1 corresponds to linear importance of the feature, order=2 corresponds to quadratic and linear
importance of the feature,
forms. (order= 3 will calculate feature importance up to cubic) –
Returns:
pd.DataFrame
a data frame containing the feature importance statistics
Rank features based on the LRT-statistics of the interaction.
Parameters:
data (pd.Dataframe) – DataFrame containing outcome, features, and experiment group
treatment_indicator (string) – the column name for binary indicator of treatment (1) or control (0)
feature_name (string) – feature name, as one column in the data DataFrame
y_name (string) – name of the outcome variable
order (int) – the order of feature to be evaluated with the treatment effect, order takes 3 values: 1,2,3.
order = 1 corresponds to linear importance of the feature, order=2 corresponds to quadratic and linear
importance of the feature,
forms. (order= 3 will calculate feature importance up to cubic) –
Returns:
pd.DataFrame
a data frame containing the feature importance statistics
Rank features based on the chosen statistic of the interaction.
Parameters:
data (pd.Dataframe) – DataFrame containing outcome, features, and experiment group
features (list of string) – list of feature names, that are columns in the data DataFrame
y_name (string) – name of the outcome variable
method (string, optional, default = 'KL') – taking one of the following values {‘F’, ‘LR’, ‘KL’, ‘ED’, ‘Chi’}
The feature selection method to be used to rank the features.
‘F’ for F-test
‘LR’ for likelihood ratio test
‘KL’, ‘ED’, ‘Chi’ for bin-based uplift filter methods, KL divergence, Euclidean distance, Chi-Square
respectively
experiment_group_column (string) – the experiment column name in the DataFrame, which contains the treatment
and control assignment label
control_group (string) – name for control group, value in the experiment group column
treatment_group (string) – name for treatment group, value in the experiment group column
n_bins (int, optional) – number of bins to be used for bin-based uplift filter methods
null_impute (str, optional, default=None) – impute np.nan present in the data taking on of the following
strategy values {‘mean’, ‘median’, ‘most_frequent’, None}. If value is None and null is present then
exception will be raised
order (int) – the order of feature to be evaluated with the treatment effect for F filter and LR filter,
order takes 3 values: 1,2,3. order = 1 corresponds to linear importance of the feature, order=2
corresponds to quadratic and linear importance of the feature,
forms. (order= 3 will calculate feature importance up to cubic) –
disp (bool) – Set to True to print convergence messages for Logistic regression convergence in LR method.
Returns:
pd.DataFrame
a data frame with following columns: [‘method’, ‘feature’, ‘rank’, ‘score’, ‘p_value’, ‘misc’]
Ahmed Alaa and Mihaela Schaar. Limits of estimating heterogeneous treatment effects: guidelines for practical algorithm design. In International Conference on Machine Learning, 129–138. 2018.
[2]
Joshua D Angrist and Jörn-Steffen Pischke. Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2008.
Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
Pierre Gutierrez and Jean-Yves Gerardy. Causal inference and uplift modeling a review of the literature. JMLR: Workshop and Conference Proceedings 67, 2016.
Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep iv: a flexible approach for counterfactual prediction. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1414–1423. JMLR. org, 2017.
Guido W Imbens and Jeffrey M Wooldridge. Recent developments in the econometrics of program evaluation. Journal of economic literature, 47(1):5–86, 2009.
[15]
Edward H. Kennedy. Optimal doubly robust estimation of heterogeneous causal effects. 2020. arXiv:2004.14497.
Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156–4165, 2019.
[17]
Mark Laan and Sherri Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer-Verlag New York, 01 2011. ISBN 978-1-4419-9781-4. doi:10.1007/978-1-4419-9782-1.
[18]
Ang Li and Judea Pearl. Unit selection based on counterfactual logic. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, 1793–1799. International Joint Conferences on Artificial Intelligence Organization, 7 2019. URL: https://doi.org/10.24963/ijcai.2019/248, doi:10.24963/ijcai.2019/248.
[19]
Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017.
Miruna Oprescu, Vasilis Syrgkanis, and Zhiwei Steven Wu. Orthogonal random forest for heterogeneous treatment effect estimation. CoRR, 2018. URL: http://arxiv.org/abs/1806.03467, arXiv:1806.03467.
[21]
Judea Pearl. Causality. Cambridge university press, 2009.
[22]
Piotr Rzepakowski and Szymon Jaroszewicz. Decision trees for uplift modeling with single and multiple treatments. Knowl. Inf. Syst., 32(2):303–327, August 2012.
[23]
Jannik Rößler, Richard Guse, and Detlef Schoder. The best of two worlds: using recent advances from uplift modeling and heterogeneous treatment effects to optimize targeting policies. International Conference on Information Systems, 2022.
[24]
Elizabeth A Stuart. Matching methods for causal inference: a review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1):1, 2010.
[25]
Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A Levine, and Xin Yan. Facilitating score and causal inference trees for large observational studies. Journal of Machine Learning Research, 13:2955, 2012.
[26]
Xiaogang Su, Chih-Ling Tsai, Hansheng Wang, David M Nickerson, and Bogong Li. Subgroup analysis via recursive partitioning. Journal of Machine Learning Research, 2009.
[27]
Jin Tian and Judea Pearl. Probabilities of causation: bounds and identification. Annals of Mathematics and Artificial Intelligence, 28(1):287–313, 2000.
[28]
Yan Zhao, Xiao Fang, and David Simchi-Levi. Uplift modeling with multiple treatments and general response types. In Proceedings of the 2017 SIAM International Conference on Data Mining, 588–596. SIAM, 2017.
[29]
Zhenyu Zhao and Totte Harinen. Uplift modeling for multiple treatments with cost optimization. In 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 422–431. IEEE, 2019.
[30]
P. Richard Hahn, Jared S. Murray, and Carlos Carvalho. Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. arXiv e-prints, pages arXiv:1706.09523, Jun 2017. arXiv:1706.09523.
In this release, we revamped documentation, cleaned up dependencies, and improved installation - in addition to the long list of bug fixes.
We have three new contributors, @peterloleungyau, @SuperBo, and @ZiJiaW, who submitted their first PRs to CausalML. @erikcs also contributed to @ras44’s PR #729 to add the wrapper for his MAQ implementation to CausalML. Thanks for your contributions!
CausalML surpassed 1MM downloads on PyPI and 3,200 stars on GitHub. Thanks for choosing CausalML and supporting us on GitHub.
We have 7 new contributors @saiwing-yeung, @lixuan12315, @aldenrogers, @vincewu51, @AlkanSte, @enzoliao, and @alexander-pv. Thanks for your contributions!
@alexander-pv revamped CausalTreeRegressor and added CausalRandomForestRegressor with more seamless integration with scikit-learn’s Cython tree module. He also added integration with shap for causal tree/ random forest interpretation. Please check out the example notebook.
We dropped the support for Python 3.6 and removed its test workflow.
This patch includes three updates by @tonkolviktor and @heiderich as follows. We also start using black, a Python formatter. Please check out the updated contribution guideline to learn how to use it.
We refactored and speeded up UpliftTreeClassifier/UpliftRandomForestClassifier by 5x with Cython (#422#440 by @jeongyoonlee)
We revamped our API documentation, it now includes the latest methodology, references, installation, notebook examples, and graphs! (#413 by @huigangchen @t-tte @zhenyuz0500 @jeongyoonlee @paullo0106)
Add propensity_learner to R-learner by @jeongyoonlee (#297)
Add BaseLearner class for other meta-learners to inherit from without duplicated code by @jeongyoonlee (#295)
Fix installation issue for Shap>=0.38.1 by @paullo0106 (#287)
Fix import error for sklearn>= 0.24 by @jeongyoonlee (#283)
Fix KeyError issue in Filter method for certain dataset by @surajiyer (#281)
Fix inconsistent cumlift score calculation of multiple models by @vaclavbelak (#273)
Fix duplicate values handling in feature selection method by @manojbalaji1 (#271)
Fix the color spectrum of SHAP summary plot for feature interpretations of meta-learners by @paullo0106 (#269)
Add IIA and value optimization related documentation by @t-tte (#264)
Fix StratifiedKFold arguments for propensity score estimation by @paullo0106 (#262)
Refactor the code with string format argument and is to compare object types, and change methods not using bound instance to static methods by @harshcasper (#256, #260)
CausalML team welcomes new project leadership, Mert Bay
We have 4 new community contributors, Mario Wijaya (@mwijaya3), Harry Zhao (@deeplaunch), Christophe (@ccrndn) and Georg Walther (@waltherg). Thanks for the contribution!