Alibi is an open source Python library aimed at machine learning model inspection and interpretation. The initial focus on the library is on blackbox, instance based model explanations.
Goals¶
Provide high quality reference implementations of blackbox ML model explanation and interpretation algorithms
Define a consistent API for interpretable ML methods
Support multiple use cases (e.g. tabular, text and image data classification, regression)
Getting Started¶
Features¶
Alibi is a Python package designed to help explain the predictions of machine learning models, gauge the confidence of predictions and eventually support wider capabilities of inspecting the performance of models with respect to concept drift and algorithmic bias. The focus of the library is to support the widest range of models using blackbox methods where possible.
To get a list of the latest available model explanation algorithms, you can type:
import alibi
alibi.explainers.__all__
['AnchorTabular',
'AnchorText',
'AnchorImage',
'CEM',
'CounterFactual',
'CounterFactualProto']
For gauging model confidence:
alibi.confidence.__all__
['linearity_measure',
'LinearityMeasure',
'TrustScore']
For detailed information on the methods:
Basic Usage¶
We will use the Anchor method on tabular data to illustrate the usage of explainers in Alibi.
First, we import the explainer:
from alibi.explainers import AnchorTabular
Next, we initialize it by passing it a prediction function and any other necessary arguments:
explainer = AnchorTabular(predict_fn, feature_names)
Some methods require an additional .fit
step which requires access to the training set the model
was trained on:
explainer.fit(X_train)
Finally, we can call the explainer on a test instance which will return a dictionary containing the explanation and any additional metadata returned by the computation:
explainer.explain(x)
The exact details will vary slightly from method to method, so we encourage the reader to become familiar with the types of algorithms supported in Alibi.
Algorithm overview¶
This page provides a highlevel overview of the algorithms and their features currently implemented in Alibi.
Model Explanations¶
These algorithms provide instancespecific (sometimes also called local) explanations of ML model predictions. Given a single instance and a model prediction they aim to answer the question “Why did my model make this prediction?” The following algorithms all work with blackbox models meaning that the only requirement is to have acces to a prediction function (which could be an API endpoint for a model in production).
The following table summarizes the capabilities of the current algorithms:
Explainer 
Model types 
Classification 
Categorical data 
Tabular 
Text 
Images 
Need training set 

blackbox 
✔ 
✔ 
✔ 
✔ 
✔ 
For Tabular 

blackbox, TF/Keras 
✔ 
✘ 
✔ 
✘ 
✔ 
Optional 

blackbox, TF/Keras 
✔ 
✘ 
✔ 
✘ 
✔ 
No 

blackbox, TF/Keras 
✔ 
✔ 
✔ 
✘ 
✔ 
Optional 
Anchor explanations: produce an “anchor”  a small subset of features and their ranges that will almost always result in the same model prediction. Documentation, tabular example, text classification, image classification.
Contrastive explanation method (CEM): produce a pertinent positive (PP) and a pertinent negative (PN) instance. The PP instance finds the features that should me minimally and sufficiently present to predict the same class as the original prediction (a PP acts as the “most compact” representation of the instance to keep the same prediction). The PN instance identifies the features that should be minimally and necessarily absent to maintain the original prediction (a PN acts as the closest instance that would result in a different prediction). Documentation, tabular example, image classification.
Counterfactual instances: generate counterfactual examples using a simple loss function. Documentation, image classification.
Prototype Counterfactuals: generate counterfactuals guided by nearest class prototypes other than the class predicted on the original instance. It can use both an encoder or kd trees to define the prototypes. This method can speed up the search, especially for black box models, and create interpretable counterfactuals. Documentation, tabular example, tabular example with categorical features, image classification.
Model Confidence¶
These algorihtms provide instancespecific scores measuring the model confidence for making a particular prediction.
Algorithm 
Model types 
Classification 
Regression 
Categorical data 
Tabular 
Text 
Images 
Need training set 

blackbox 
✔ 
✘ 
✘ 
✔ 
✔1 
✔2 
Yes 

blackbox 
✔ 
✔ 
✘ 
✔ 
✘ 
✔ 
Optional 
Trust scores: produce a “trust score” of a classifier’s prediction. The trust score is the ratio between the distance to the nearest class different from the predicted class and the distance to the predicted class, higher scores correspond to more trustworthy predictions. Documentation, tabular example, image classification
Linearity measure: produces a score quantifying how linear the model is around a test instance. The linearity score measures the model linearity around a test instance by feeding the model linear superpositions of inputs and comparing the outputs with the linear combination of outputs from predictions on single inputs. Tabular example, image classification
Roadmap¶
Alibi aims to be the goto library for ML model interpretability and monitoring. There are multiple challenges for developing a highquality, productionready library that achieves this. In addition to having high quality reference implementations of the most promising algorithms, we need extensive documentation and case studies comparing the different interpretability methods and their respective pros and cons. A clean and a usable API is also a priority.
Short term¶
Ongoing optimizations of existing algorithms (speed, parallelisation, explanation quality)
Finalize a unified API (Github PR)
Initial visualizations and visualization backends (Github issue)
Medium term¶
Migrate counterfactual methods to TensorFlow 2.0 (Github issue)
Additional blackbox explanation methods (ALE)
Additional model confidence/calibration methods
Provide integration with 3rd party explanation libraries (e.g. SHAP)
Whitebox explanation methods (e.g. Integrated Gradients)
Support both TensorFlow and PyTorch for whitebox methods
Long term¶
Explanations for regression models
Explanations for sequential and structured data
This page was generated from doc/source/methods/Anchors.ipynb.
Anchors¶
Overview¶
The anchor algorithm is based on the Anchors: HighPrecision ModelAgnostic Explanations paper by Ribeiro et al. and builds on the open source code from the paper’s first author.
The algorithm provides modelagnostic (black box) and human interpretable explanations suitable for classification models applied to images, text and tabular data. The idea behind anchors is to explain the behaviour of complex models with highprecision rules called anchors. These anchors are locally sufficient conditions to ensure a certain prediction with a high degree of confidence.
Anchors address a key shortcoming of local explanation methods like LIME which proxy the local behaviour of the model in a linear way. It is however unclear to what extent the explanation holds up in the region around the instance to be explained, since both the model and data can exhibit nonlinear behaviour in the neighborhood of the instance. This approach can easily lead to overconfidence in the explanation and misleading conclusions on unseen but similar instances. The anchor algorithm tackles this issue by incorporating coverage, the region where the explanation applies, into the optimization problem. A simple example from sentiment classification illustrates this (Figure 1). Dependent on the sentence, the occurrence of the word not is interpreted as positive or negative for the sentiment by LIME. It is clear that the explanation using not is very local. Anchors however aim to maximize the coverage, and require not to occur together with good or bad to ensure respectively negative or positive sentiment.
Ribeiro et al., Anchors: HighPrecision ModelAgnostic Explanations, 2018
As highlighted by the above example, an anchor explanation consists of ifthen rules, called the anchors, which sufficiently guarantee the explanation locally and try to maximize the area for which the explanation holds. This means that as long as the anchor holds, the prediction should remain the same regardless of the values of the features not present in the anchor. Going back to the sentiment example: as long as not good is present, the sentiment is negative, regardless of the other words in the movie review.
Text¶
For text classification, an interpretable anchor consists of the words that need to be present to ensure a prediction, regardless of the other words in the input. The words that are not present in a candidate anchor can be sampled in 2 ways:
Replace word token by UNK token.
Replace word token by sampled token from a corpus with the same POS tag and probability proportional to the similarity in the embedding space. By sampling similar words, we keep more context than simply using the UNK token.
Tabular Data¶
Anchors are also suitable for tabular data with both categorical and continuous features. The continuous features are discretized into quantiles (e.g. deciles), so they become more interpretable. The features in a candidate anchor are kept constant (same category or bin for discretized features) while we sample the other features from a training set. As a result, anchors for tabular data need access to training data. Let’s illustrate this with an example. Say we want to predict whether a person makes less or more than £50,000 per year based on the person’s characteristics including age (continuous variable) and marital status (categorical variable). The following would then be a potential anchor: Hugo makes more than £50,000 because he is married and his age is between 35 and 45 years.
Images¶
Similar to LIME, images are first segmented into superpixels, maintaining local image structure. The interpretable representation then consists of the presence or absence of each superpixel in the anchor. It is crucial to generate meaningful superpixels in order to arrive at interpretable explanations. The algorithm supports a number of standard image segmentation algorithms (felzenszwalb, slic and quickshift) and allows the user to provide a custom segmentation function.
The superpixels not present in a candidate anchor can be masked in 2 ways:
Take the average value of that superpixel.
Use the pixel values of a superimposed picture over the masked superpixels.
Ribeiro et al., Anchors: HighPrecision ModelAgnostic Explanations, 2018
Efficiently Computing Anchors¶
The anchor needs to return the same prediction as the original instance with a minimal confidence of e.g. 95%. If multiple candidate anchors satisfy this constraint, we go with the anchor that has the largest coverage. Because the number of potential anchors is exponential in the feature space, we need a faster approximate solution.
The anchors are constructed bottomup in combination with beam search. We start with an empty rule or anchor, and incrementally add an ifthen rule in each iteration until the minimal confidence constraint is satisfied. If multiple valid anchors are found, the one with the largest coverage is returned.
In order to select the best candidate anchors for the beam width efficiently during each iteration, we formulate the problem as a pure exploration multiarmed bandit problem. This limits the number of model prediction calls which can be a computational bottleneck.
For more details, we refer the reader to the original paper.
Usage¶
While each data type has specific requirements to initialize the explainer and return explanations, the underlying algorithm to construct the anchors is the same.
In order to efficiently generate anchors, the following hyperparameters need to be set to sensible values when calling the explain
method:
threshold
: the previously discussed minimal confidence level.threshold
defines the minimum fraction of samples for a candidate anchor that need to lead to the same prediction as the original instance. A higher value gives more confidence in the anchor, but also leads to more computation time. The default value is 0.95.tau
: determines when we assume convergence for the multiarmed bandit. A bigger value fortau
means faster convergence but also looser anchor conditions. By default equal to 0.15.beam_size
: the size of the beam width. A bigger beam width can lead to a better overall anchor at the expense of more computation time.batch_size
: the batch size used for sampling. A bigger batch size gives more confidence in the anchor, again at the expense of computation time since it involves more model prediction calls. The default value is 100.coverage_samples
: number of samples used to compute the coverage of the anchor. By default set to 10000.
Text¶
Initialization¶
Since the explainer works on black box models, only access to a predict function is needed. The model below is a simple logistic regression trained on movie reviews with negative or positive sentiment and preprocessed with a CountVectorizer:
predict_fn = lambda x: clf.predict(vectorizer.transform(x))
If we choose to sample similar words from a corpus, we first need to load a spaCy model:
import spacy
from alibi.utils.download import spacy_model
model = 'en_core_web_md'
spacy_model(model=model)
nlp = spacy.load(model)
We can now initialize our explainer:
explainer = AnchorText(nlp, predict_fn)
Explanation¶
Let’s define the instance we want to explain and verify that the sentiment prediction on the original instance is positive:
text = 'This is a good book .'
class_names = ['negative', 'positive']
pred = class_names[predict_fn([text])[0]]
Now we can explain the instance:
explanation = explainer.explain(text, threshold=0.95, use_similarity_proba=False,
use_unk=True, sample_proba=0.5)
We set the confidence threshold
at 95%. use_unk
equals True means that we replace words outside of the candidate anchor with UNK tokens with a sample probability equal to sample_proba
. Instead of using UNK tokens, we can sample from the top_n
similar words to the ground truth word in the corpus by setting use_unk
to False.
explanation = explainer.explain(text, threshold=0.95, use_unk=False, sample_proba=0.5, top_n=100)
It is also possible to sample words from the corpus proportional to the word similarity with the ground truth word by setting use_similarity_proba
to True and use_unk
to False. We can put more weight on similar words by decreasing the temperature
argument. The following explanation perturbs original tokens with probability equal to sample_proba
. The perturbed tokens are then sampled from the top_n
most similar tokens in the corpus with sample probability proportional to the
word similarity with the original token.
explanation = explainer.explain(text, threshold=0.95, use_similarity_proba=True, use_unk=False,
sample_proba=0.5, top_n=20, temperature=0.2)
The explain
method returns a dictionary that contains key: value pairs for:
names: the words in the anchor.
precision: the fraction of times the sampled instances where the anchor holds yields the same prediction as the original instance. The precision will always be \(\geq\)
threshold
for a valid anchor.coverage: the coverage of the anchor over a sampled part of the training set.
Under the raw key, the dictionary also contains example instances where the anchor holds and the prediction is the same as on the original instance, as well as examples where the anchor holds but the prediction changed to give the user a sense of where the anchor fails. raw also stores information on the names, precision and coverage of partial anchors. This allows the user to track the improvement in for instance the precision as more features (words in the case of text) are added to the anchor.
Tabular Data¶
Initialization and fit¶
To initialize the explainer, we provide a predict function, a list with the feature names to make the anchors easy to understand as well as an optional mapping from the encoded categorical features to a description of the category. An example for categorical_names
would be category_map = {0: list(‘married’, ‘divorced’), 3: list(‘high school diploma’, ‘master’s degree’)}. Each key in category_map refers to the column index in the input for the relevant categorical variable, while the
values are lists with the options for each categorical variable. To make it easy, we provide a utility function gen_category_map
to generate this map automatically from a Pandas dataframe:
from alibi.utils.data import gen_category_map
category_map = gen_category_map(df)
Then initialize the explainer:
predict_fn = lambda x: clf.predict(preprocessor.transform(x))
explainer = AnchorTabular(predict_fn, feature_names, categorical_names=category_map)
Tabular data requires a fit step to map the ordinal features into quantiles and therefore needs access to a representative set of the training data. disc_perc
is a list with percentiles used for binning:
explainer.fit(X_train, disc_perc=[25, 50, 75])
Explanation¶
Let’s check the prediction of the model on the original instance and explain:
class_names = ['<=50K', '>50K']
pred = class_names[explainer.predict_fn(X)[0]]
explanation = explainer.explain(X, threshold=0.95)
The returned explanation dictionary contains the same key: value pairs as the text explainer, so you could explain a prediction as follows:
Prediction: <=50K
Anchor: Marital Status = NeverMarried AND Relationship = Ownchild
Precision: 1.00
Coverage: 0.13
Images¶
Initialization¶
Besides the predict function, we also need to specify either a built in or custom superpixel segmentation function. The built in methods are felzenszwalb, slic and quickshift. It is important to create sensible superpixels in order to speed up convergence and generate interpretable explanations. Tuning the hyperparameters of the segmentation method is recommended.
explainer = AnchorImage(predict_fn, image_shape, segmentation_fn='slic',
segmentation_kwargs={'n_segments': 15, 'compactness': 20, 'sigma': .5},
images_background=None)
Example of superpixels generated for the Persian cat picture using the slic method:
The following function would be an example of a custom segmentation function dividing the image into rectangles.
def superpixel(image, size=(4, 7)):
segments = np.zeros([image.shape[0], image.shape[1]])
row_idx, col_idx = np.where(segments == 0)
for i, j in zip(row_idx, col_idx):
segments[i, j] = int((image.shape[1]/size[1]) * (i//size[0]) + j//size[1])
return segments
The images_background
parameter allows the user to provide images used to superimpose on the masked superpixels, not present in the candidate anchor, instead of taking the average value of the masked superpixel. The superimposed images need to have the same shape as the explained instance.
Explanation¶
We can then explain the instance in the usual way:
explanation = explainer.explain(image, p_sample=.5)
p_sample
determines the fraction of superpixels that are either changed to the average superpixel value or that are superimposed.
The explanation dictionary again contains information about the anchor’s precision, coverage and examples where the anchor does or does not hold. On top of that, it also contains a masked image with only the anchor superpixels visible under the anchor key (see image below) as well as the image’s superpixels under segments.
This page was generated from doc/source/methods/CEM.ipynb.
Contrastive Explanation Method¶
Overview¶
The Contrastive Explanation Method (CEM) is based on the paper Explanations based on the Missing: Towards Constrastive Explanations with Pertinent Negatives and extends the code open sourced by the authors. CEM generates instance based local black box explanations for classification models in terms of Pertinent Positives (PP) and Pertinent Negatives (PN). For a PP, the method finds the features that should be minimally and sufficiently present (e.g. important pixels in an image) to predict the same class as on the original instance. PN’s on the other hand identify what features should be minimally and necessarily absent from the instance to be explained in order to maintain the original prediction class. The aim of PN’s is not to provide a full set of characteristics that should be absent in the explained instance, but to provide a minimal set that differentiates it from the closest different class. Intuitively, the Pertinent Positives could be compared to Anchors while Pertinent Negatives are similar to Counterfactuals. As the authors of the paper state, CEM can generate clear explanations of the form: “An input x is classified in class y because features \(f_{i}\), …, \(f_{k}\) are present and because features \(f_{m}\), …, \(f_{p}\) are absent.” The current implementation is most suitable for images and tabular data without categorical features.
In order to create interpretable PP’s and PN’s, featurewise perturbation needs to be done in a meaningful way. To keep the perturbations sparse and close to the original instance, the objective function contains an elastic net (\(\beta\)\(L_{1}\) + \(L_{2}\)) regularizer. Optionally, an autoencoder can be trained to reconstruct instances of the training set. We can then introduce the \(L_{2}\) reconstruction error of the perturbed instance as an additional loss term in our objective function. As a result, the perturbed instance lies close to the training data manifold.
The ability to add or remove features to arrive at respectively PN’s or PP’s implies that there are feature values that contain no information with regards to the model’s predictions. Consider for instance the MNIST image below where the pixels are scaled between 0 and 1. The pixels with values close to 1 define the number in the image while the background pixels have value 0. We assume that perturbations towards the background value 0 are equivalent to removing features, while perturbations towards 1 imply adding features.
It is intuitive to understand that adding features to get a PN means changing 0’s into 1’s until a different number is formed, in this case changing a 4 into a 9.
To find the PP, we do the opposite and change 1’s from the original instance into 0’s, the background value, and only keep a vague outline of the original 4.
It is however often not trivial to find these noninformative feature values and domain knowledge becomes very important.
For more details, we refer the reader to the original paper.
Usage¶
Initialization¶
The optimizer is defined in TensorFlow (TF) internally. We first load our MNIST classifier and the (optional) autoencoder. The example below uses Keras or TF models. This allows optimization of the objective function to run entirely with automatic differentiation because the TF graph has access to the underlying model architecture. For models built in different frameworks (e.g. scikitlearn), the gradients of part of the loss function with respect to the input features need to be evaluated numerically. We’ll handle this case later.
# define models
cnn = load_model('mnist_cnn.h5')
ae = load_model('mnist_ae.h5')
We can now initialize the CEM explainer:
# initialize CEM explainer
shape = (1,) + x_train.shape[1:]
mode = 'PN'
cem = CEM(cnn, mode, shape, kappa=0., beta=.1,
feature_range=(x_train.min(), x_train.max()),
gamma=100, ae_model=ae, max_iterations=1000,
c_init=1., c_steps=10, learning_rate_init=1e2,
clip=(1000.,1000.), no_info_val=1.)
Besides passing the the predictive and autoencoder models, we set a number of hyperparameters …
… general:
mode
: ‘PN’ or ‘PP’.shape
: shape of the instance to be explained, starting with batch dimension. Currently only single explanations are supported, so the batch dimension should be equal to 1.feature_range
: global or featurewise min and max values for the perturbed instance.
… related to the optimizer:
max_iterations
: number of loss optimization steps for each value of c; the multiplier of the first loss term.learning_rate_init
: initial learning rate, follows polynomial decay.clip
: min and max gradient values.
… related to the noninformative value:
no_info_val
: as explained in the previous section, it is important to define which feature values are considered background and not crucial for the class predictions. For MNIST images scaled between 0 and 1 or 0.5 and 0.5 as in the notebooks, pixel perturbations in the direction of the (low) background pixel value can be seen as removing features, moving towards the noninformative value. As a result, theno_info_val
parameter is set at a low value like 1.no_info_val
can be defined globally or featurewise. For most applications, domain knowledge becomes very important here. If a representative sample of the training set is available, we can always (naively) infer ano_info_val
by taking the featurewise median or mean:
cem.fit(x_train, no_info_type='median')
… related to the objective function:
c_init
andc_steps
: the multiplier \(c\) of the first loss term is updated forc_steps
iterations, starting atc_init
. The first loss term encourages the perturbed instance to be predicted as a different class for a PN and the same class for a PP. If we find a candidate PN or PP for the current value of \(c\), we reduce the value of \(c\) for the next optimization cycle to put more emphasis on the regularization terms and improve the solution. If we cannot find a solution, \(c\) is increased to put more weight on the prediction class restrictions of the PN and PP before focusing on the regularization.kappa
: the first term in the loss function is defined by a difference between the predicted probabilities for the perturbed instance of the original class and the max of the other classes. \(\kappa \geq 0\) defines a cap for this difference, limiting its impact on the overall loss to be optimized. Similar to the original paper, we set \(\kappa\) to 0. in the examples.beta
: \(\beta\) is the \(L_{1}\) loss term multiplier. A higher value for \(\beta\) means more weight on the sparsity restrictions of the perturbations. Similar to the paper, we set \(\beta\) to 0.1 for the MNIST and Iris datasets.gamma
: multiplier for the optional \(L_{2}\) reconstruction error. A higher value for \(\gamma\) means more emphasis on the reconstruction error penalty defined by the autoencoder. Similar to the paper, we set \(\gamma\) to 100 when we have an autoencoder available.
While the paper’s default values for the loss term coefficients worked well for the simple examples provided in the notebooks, it is recommended to test their robustness for your own applications.
Explanation¶
We can finally explain the instance:
explanation = cem.explain(X)
The explain
method returns a dictionary with the following key: value pairs:
X: original instance
X_pred: predicted class of original instance
PN or PP: Pertinent Negative or Pertinant Positive
PN_pred or PP_pred: predicted class of PN or PP
grads_graph: gradient values computed from the TF graph with respect to the input features at the PN or PP
grads_num: numerical gradient values with respect to the input features at the PN or PP
Numerical Gradients¶
So far, the whole optimization problem could be defined within the internal TF graph, making autodiff possible. It is however possible that we do not have access to the model architecture and weights, and are only provided with a predict
function returning probabilities for each class. We initialize the CEM in the same way as before:
# define model
lr = load_model('iris_lr.h5')
predict_fn = lambda x: lr.predict(x)
# initialize CEM explainer
shape = (1,) + x_train.shape[1:]
mode = 'PP'
cem = CEM(predict_fn, mode, shape, kappa=0., beta=.1,
feature_range=(x_train.min(), x_train.max()),
eps=(1e2, 1e2), update_num_grad=100)
In this case, we need to evaluate the gradients of the loss function with respect to the input features numerically:
where \(L\) is the loss function, \(p\) the predict function and \(x\) the input features to optimize. There are now 2 additional hyperparameters to consider:
eps
: a tuple to define the perturbation size used to compute the numerical gradients.eps[0]
andeps[1]
are used respectively for \(^{\delta L}/_{\delta p}\) and \(^{\delta p}/_{\delta x}\).eps[0]
andeps[1]
can be a combination of float values or numpy arrays. Foreps[0]
, the array dimension should be (1 x nb of prediction categories) and foreps[1]
it should be (1 x nb of features). For the Iris dataset,eps
could look as follows:
eps0 = np.array([[1e2, 1e2, 1e2]]) # 3 prediction categories, equivalent to 1e2
eps1 = np.array([[1e2, 1e2, 1e2, 1e2]]) # 4 features, also equivalent to 1e2
eps = (eps0, eps1)
update_num_grad
: for complex models with a high number of parameters and a high dimensional feature space (e.g. Inception on ImageNet), evaluating numerical gradients can be expensive as they involve prediction calls for each perturbed instance. Theupdate_num_grad
parameter allows you to set a batch size on which to evaluate the numerical gradients, reducing the number of prediction calls required.
This page was generated from doc/source/methods/CF.ipynb.
Counterfactual Instances¶
Overview¶
A counterfactual explanation of an outcome or a situation \(Y\) takes the form “If \(X\) had not occured, \(Y\) would not have occured” (Interpretable Machine Learning). In the context of a machine learning classifier \(X\) would be an instance of interest and \(Y\) would be the label predicted by the model. The task of finding a counterfactual explanation is then to find some \(X^\prime\) that is in some way related to the original instance \(X\) but leading to a different prediction \(Y^\prime\). Reasoning in counterfactual terms is very natural for humans, e.g. asking what should have been done differently to achieve a different result. As a consequence counterfactual instances for machine learning predictions is a promising method for humaninterpretable explanations.
The counterfactual method described here is the most basic way of defining the problem of finding such \(X^\prime\). Our algorithm loosely follows Wachter et al. (2017): Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR. For an extension to the basic method which provides ways of finding higher quality counterfactual instances \(X^\prime\) in a quicker time, please refer to Counterfactuals Guided by Prototypes.
We can reason that the most basic requirements for a counterfactual \(X^\prime\) are as follows:
The predicted class of \(X^\prime\) is different from the predicted class of \(X\)
The difference between \(X\) and \(X^\prime\) should be humaninterpretable.
While the first condition is straightforward, the second condition does not immediately lend itself to a condition as we need to first define “interpretability” in a mathematical sense. For this method we restrict ourselves to a particular definition by asserting that \(X^\prime\) should be as close as possible to \(X\) without violating the first condition. There main issue with this definition of “interpretability” is that the difference between \(X^\prime\) and \(X\) required to change the model prediciton might be so small as to be uninterpretable to the human eye in which case we need a more sophisticated approach.
That being said, we can now cast the search for \(X^\prime\) as a simple optimization problem with the following loss:
where the first lost term \(L_{\text{pred}}\) guides the search towards points \(X^\prime\) which would change the model prediction and the second term \(\lambda L_{\text{dist}}\) ensures that \(X^\prime\) is close to \(X\). This form of loss has a single hyperparameter \(\lambda\) weighing the contributions of the two competing terms.
The specific loss in our implementation is as follows:
Here \(t\) is the desired target class for \(X^\prime\) which can either be specified in advance or left up to the optimization algorithm to find, \(p_t\) is the target probability of this class (typically \(p_t=1\)), \(f_t\) is the model prediction on class \(t\) and \(L_1\) is the distance between the proposed counterfactual instance \(X^\prime\) and the instance to be explained \(X\). The use of the \(L_1\) distance should ensure that the \(X^\prime\) is a sparse counterfactual  minimizing the number of features to be changed in order to change the prediction.
The optimal value of the hyperparameter \(\lambda\) will vary from dataset to dataset and even within a dataset for each instance to be explained and the desired target class. As such it is difficult to set and we learn it as part of the optimization algorithm, i.e. we want to optimize
subject to
where \(\epsilon\) is a tolerance parameter. In practice this is done in two steps, on the first pass we sweep a broad range of \(\lambda\), e.g. \(\lambda\in(10^{1},\dots,10^{10}\)) to find lower and upper bounds \(\lambda_{\text{lb}}, \lambda_{\text{ub}}\) where counterfactuals exist. Then we use bisection to find the maximum \(\lambda\in[\lambda_{\text{lb}}, \lambda_{\text{ub}}]\) such that the counterfactual constraint still holds. The result is a set of counterfactual instances \(X^\prime\) with varying distance from the test instance \(X\).
Usage¶
Initialization¶
The counterfactual (CF) explainer method works on fully blackbox models, meaning they can work with arbitrary functions that take arrays and return arrays. However, if the user has access to a full TensorFlow (TF) or Keras model, this can be passed in as well to take advantage of the automatic differentiation in TF to speed up the search. This section describes the initialization for a TF/Keras model, for fully blackbox models refer to numerical gradients.
First we load the TF/Keras model:
model = load_model('my_model.h5')
Then we can initialize the counterfactual object:
shape = (1,) + x_train.shape[1:]
cf = CounterFactual(model, shape, distance_fn='l1', target_proba=1.0,
target_class='other', max_iter=1000, early_stop=50, lam_init=1e1,
max_lam_steps=10, tol=0.05, learning_rate_init=0.1,
feature_range=(1e10, 1e10), eps=0.01, init='identity',
decay=True, write_dir=None, debug=False)
Besides passing the the model, we set a number of hyperparameters …
… general:
shape
: shape of the instance to be explained, starting with batch dimension. Currently only single explanations are supported, so the batch dimension should be equal to 1.feature_range
: global or featurewise min and max values for the perturbed instance.write_dir
: write directory for Tensorboard logging of the loss terms. It can be helpful when tuning the hyperparameters for your use case. It makes it easy to verify that e.g. not 1 loss term dominates the optimization, that the number of iterations is OK etc. You can access Tensorboard by runningtensorboard logdir {write_dir}
in the terminal.debug
: flag to enable/disable writing to Tensorboard.
… related to the optimizer:
max_iterations
: number of loss optimization steps for each value of \(\lambda\); the multiplier of the distance loss term.learning_rate_init
: initial learning rate, follows linear decay.decay
: flag to disable learning rate decay if desiredearly_stop
: early stopping criterion for the search. If no counterfactuals are found for this many steps or if this many counterfactuals are found in a row we change \(\lambda\) accordingly and continue the search.init
: how to initialize the search, currently only"identity"
is supported meaning the search starts from the original instance.
… related to the objective function:
distance_fn
: distance function between the test instance \(X\) and the proposed counterfactual \(X^\prime\), currently only"l1"
is supported.target_proba
: desired target probability for the returned counterfactual instance. Defaults to1.0
, but it could be useful to reduce it to allow a looser definition of a counterfactual instance.tol
: the tolerance within thetarget_proba
, this works in tandem withtarget_proba
to specify a range of acceptable predicted probability values for the counterfactual.target_class
: desired target class for the returned counterfactual instance. Can be either an integer denoting the specific class membership or the stringother
which will find a counterfactual instance whose predicted class is anything other than the class of the test instance.lam_init
: initial value of the hyperparameter \(\lambda\). This is set to a high value \(\lambda=1e^{1}\) and annealed during the search to find good bounds for \(\lambda\) and for most applications should be fine to leave as default.max_lam_steps
: the number of steps (outer loops) to search for with a different value of \(\lambda\).
While the default values for the loss term coefficients worked well for the simple examples provided in the notebooks, it is recommended to test their robustness for your own applications.
Fit¶
The method is purely unsupervised so no fit method is necessary.
Explanation¶
We can now explain the instance \(X\):
explanation = cf.explain(X)
The explain
method returns a dictionary with the following key: value pairs:
cf: dictionary containing the counterfactual instance found with the smallest distance to the test instance, it has the following keys:
X: the counterfactual instance
distance: distance to the original instance
lambda: value of \(\lambda\) corresponding to the counterfactual
index: the step in the search procedure when the counterfactual was found
class: predicted class of the counterfactual
proba: predicted class probabilities of the counterfactual
loss: counterfactual loss
orig_class: predicted class of original instance
orig_proba: predicted class probabilites of the original instance
all: dictionary of all instances encountered during the search that satisfy the counterfactual constraint but have higher distance to the original instance than the returned counterfactual. This is organized by levels of \(\lambda\), i.e.
explanation['all'][0]
will be a list of dictionaries corresponding to instances satisfying the counterfactual condition found in the first iteration over \(\lambda\) during bisection.
Numerical Gradients¶
So far, the whole optimization problem could be defined within the TF graph, making automatic differentiation possible. It is however possible that we do not have access to the model architecture and weights, and are only provided with a predict
function returning probabilities for each class. The counterfactual can then be initialized in the same way as before, but using a prediction function:
# define model
model = load_model('mnist_cnn.h5')
predict_fn = lambda x: cnn.predict(x)
# initialize explainer
shape = (1,) + x_train.shape[1:]
cf = CounterFactual(predict_fn, shape, distance_fn='l1', target_proba=1.0,
target_class='other', max_iter=1000, early_stop=50, lam_init=1e1,
max_lam_steps=10, tol=0.05, learning_rate_init=0.1,
feature_range=(1e10, 1e10), eps=0.01, init
In this case, we need to evaluate the gradients of the loss function with respect to the input features \(X\) numerically:
where \(L_\text{pred}\) is the predict function loss term, \(p\) the predict function and \(x\) the input features to optimize. There is now an additional hyperparameter to consider:
eps
: a float or an array of floats to define the perturbation size used to compute the numerical gradients of \(^{\delta p}/_{\delta X}\). If a single float, the same perturbation size is used for all features, if the array dimension is (1 x nb of features), then a separate perturbation value can be used for each feature. For the Iris dataset,eps
could look as follows:
eps = np.array([[1e2, 1e2, 1e2, 1e2]]) # 4 features, also equivalent to eps=1e2
Examples¶
This page was generated from doc/source/methods/CFProto.ipynb.
Counterfactuals Guided by Prototypes¶
Overview¶
This method is based on the Interpretable Counterfactual Explanations Guided by Prototypes paper which proposes a fast, model agnostic method to find interpretable counterfactual explanations for classifier predictions by using class prototypes.
Humans often think about how they can alter the outcome of a situation. What do I need to change for the bank to approve my loan? is a common example. This form of counterfactual reasoning comes natural to us and explains how to arrive at a desired outcome in an interpretable manner. Moreover, examples of counterfactual instances resulting in a different outcome can give powerful insights of what is important to the the underlying decision process. This makes it a compelling method to explain predictions of machine learning models. In the context of predictive models, a counterfactual instance describes the necessary change in input features of a test instance that alter the prediction to a predefined output (e.g. a prediction class). The counterfactual is found by iteratively perturbing the input features of the test instance during an optimization process until the desired output is achieved.
A high quality counterfactual instance \(x_{cf}\) should have the following desirable properties:
The model prediction on \(x_{cf}\) needs to be close to the predefined output.
The perturbation \(\delta\) changing the original instance \(x_{0}\) into \(x_{cf} = x_{0} + \delta\) should be sparse.
The counterfactual \(x_{cf}\) should be interpretable. This implies that \(x_{cf}\) needs to lie close to both the overall and counterfactual class specific data distribution.
The counterfactual \(x_{cf}\) needs to be found fast enough so it can be used in a real life setting.
We can obtain those properties by incorporating additional loss terms in the objective function that is optimized using gradient descent. A basic loss function for a counterfactual can look like this:
\(Loss\) = \(cL_{pred}\) + \(\beta\)\(L_{1}\) + \(L_{2}\)
The first loss term, \(cL_{pred}\), encourages the perturbed instance to predict another class than the original instance. The \(\beta\)\(L_{1}\) + \(L_{2}\) terms act as an elastic net regularizer and introduce sparsity by penalizing the size of the difference between the counterfactual and the perturbed instance. While we can obtain sparse counterfactuals using this objective function, these are often not very interpretable because the training data distribution is not taken into account, and the perturbations are not necessarily meaningful.
The Contrastive Explanation Method (CEM) uses an autoencoder which is trained to reconstruct instances of the training set. We can then add the \(L_{2}\) reconstruction error of the perturbed instance as a loss term to keep the counterfactual close to the training data distribution. The loss function becomes:
\(Loss\) = \(cL_{pred}\) + \(\beta\)\(L_{1}\) + \(L_{2}\) + \(L_{AE}\)
The \(L_{AE}\) does however not necessarily lead to interpretable solutions or speed up the counterfactual search. The lack of interpretability occurs because the overall data distribution is followed, but not the class specific one. That’s where the prototype loss term \(L_{proto}\) comes in. To define the prototype for each prediction class, we can use the encoder part of the previously mentioned autoencoder. We also need the training data or at least a representative sample. We use the model to make predictions on this data set. For each predicted class, we encode the instances belonging to that class. The class prototype is simply the average of the k closest encodings in that class to the encoding of the instance that we want to explain. When we want to generate a counterfactual, we first find the nearest prototype other than the one for the predicted class on the original instance. The \(L_{proto}\) loss term tries to minimize the \(L_{2}\) distance between the counterfactual and the nearest prototype. As a result, the perturbations are guided to the closest prototype, speeding up the counterfactual search and making the perturbations more meaningful as they move towards a typical indistribution instance. If we do not have a trained encoder available, we can build class representations using kd trees for each class. The prototype is then the k nearest instance from a kd tree other than the tree which represents the predicted class on the original instance. The loss function now looks as follows:
\(Loss\) = \(cL_{pred}\) + \(\beta\)\(L_{1}\) + \(L_{2}\) + \(L_{AE}\) + \(L_{proto}\)
The method allows us to select specific prototype classes to guide the counterfactual to. For example, in MNIST the closest prototype to a 9 could be a 4. However, we can specify that we want to move towards the 7 prototype and avoid 4.
In order to help interpretability, we can also add a trust score constraint on the proposed counterfactual. The trust score is defined as the ratio of the distance between the encoded counterfactual and the prototype of the class predicted on the original instance, and the distance between the encoded counterfactual and the prototype of the class predicted for the counterfactual instance. Intuitively, a high trust score implies that the counterfactual is far from the originally predicted class compared to the counterfactual class. For more info on trust scores, please check out the documentation.
Because of the \(L_{proto}\) term, we can actually remove the prediction loss term and still obtain an interpretable counterfactual. This is especially relevant for fully black box models. When we provide the counterfactual search method with a Keras or TensorFlow model, it is incorporated in the TensorFlow graph and evaluated using automatic differentiation. However, if we only have access to the model’s prediction function, the gradient updates are numerical and typically require a large number of prediction calls because of \(L_{pred}\). These prediction calls can slow the search down significantly and become a bottleneck. We can represent the gradient of the loss term as follows:
where \(p\) is the prediction function and \(x\) the input features to optimize. For a 28 by 28 MNIST image, the \(^{\delta p}/_{\delta x}\) term alone would require a prediction call with batch size 28x28x2 = 1568. By using the prototypes to guide the search however, we can remove the prediction loss term and only make a single prediction at the end of each gradient update to check whether the predicted class on the proposed counterfactual is different from the original class.
Categorical Variables¶
It is crucial for many machine learning applications to deal with both continuous numerical and categorical data. Explanation methods which rely on perturbations or sampling of the input features need to make sure those perturbations are meaningful and capture the underlying structure of the data. If not done properly, the perturbed or sampled instances are possibly out of distribution compared to the training data and result in misleading explanations. The perturbation or sampling process becomes tricky for categorical features. For instance random perturbations across possible categories or enforcing a ranking between categories based on frequency of occurrence in the training data do not capture this structure.
Our method first computes the pairwise distances between categories of a categorical variable based on either the model predictions (MVDM) or the context provided by the other variables in the dataset (ABDM). For MVDM, we use the difference between the conditional model prediction probabilities of each category. This method is based on the Modified Value Difference Metric (MVDM) by Cost et al (1993). ABDM stands for AssociationBased Distance Metric, a categorical distance measure introduced by Le et al (2005). ABDM infers context from the presence of other variables in the data and computes a dissimilarity measure based on the KullbackLeibler divergence. Both methods can also be combined as ABDMMVDM. We can then apply multidimensional scaling to project the pairwise distances into Euclidean space. More details will be provided in a forthcoming paper.
The different use cases are highlighted in the example notebooks linked at the bottom of the page.
Usage¶
Initialization¶
The counterfactuals guided by prototypes method works on fully blackbox models. This means that they can work with arbitrary functions that take arrays and return arrays. However, if the user has access to a full TensorFlow (TF) or Keras model, this can be passed in as well to take advantage of the automatic differentiation in TF to speed up the search. This section describes the initialization for a TF/Keras model. Please see the numerical gradients section for black box models.
We first load our MNIST classifier and the (optional) autoencoder and encoder:
cnn = load_model('mnist_cnn.h5')
ae = load_model('mnist_ae.h5')
enc = load_model('mnist_enc.h5')
We can now initialize the counterfactual:
shape = (1,) + x_train.shape[1:]
cf = CounterFactualProto(cnn, shape, kappa=0., beta=.1, gamma=100., theta=100.,
ae_model=ae, enc_model=enc, max_iterations=500,
feature_range=(.5, .5), c_init=1., c_steps=5,
learning_rate_init=1e2, clip=(1000., 1000.), write_dir='./cf')
Besides passing the predictive, and (optional) autoencoder and models, we set a number of hyperparameters …
… general:
shape
: shape of the instance to be explained, starting with batch dimension. Currently only single explanations are supported, so the batch dimension should be equal to 1.feature_range
: global or featurewise min and max values for the perturbed instance.write_dir
: write directory for Tensorboard logging of the loss terms. It can be helpful when tuning the hyperparameters for your use case. It makes it easy to verify that e.g. not 1 loss term dominates the optimization, that the number of iterations is OK etc. You can access Tensorboard by runningtensorboard logdir {write_dir}
in the terminal. The figure below for example shows the loss to be optimized over different \(c\) iterations. It is clear that within each iteration, the number ofmax_iterations
steps is too high and we can speed up the search.
… related to the optimizer:
max_iterations
: number of loss optimization steps for each value of c; the multiplier of the first loss term.learning_rate_init
: initial learning rate, follows polynomial decay.clip
: min and max gradient values.
… related to the objective function:
c_init
andc_steps
: the multiplier \(c\) of the first loss term is updated forc_steps
iterations, starting atc_init
. The first loss term encourages the perturbed instance to be predicted as a different class than the original instance. If we find a candidate counterfactual for the current value of \(c\), we reduce the value of \(c\) for the next optimization cycle to put more emphasis on the other loss terms and improve the solution. If we cannot find a solution, \(c\) is increased to put more weight on the prediction class restrictions of the counterfactual.kappa
: the first term in the loss function is defined by a difference between the predicted probabilities for the perturbed instance of the original class and the max of the other classes. \(\kappa \geq 0\) defines a cap for this difference, limiting its impact on the overall loss to be optimized. Similar to CEM, we set \(\kappa\) to 0 in the examples.beta
: \(\beta\) is the \(L_{1}\) loss term multiplier. A higher value for \(\beta\) means more weight on the sparsity restrictions of the perturbations. \(\beta\) equal to 0.1 works well for the example datasets.gamma
: multiplier for the optional \(L_{2}\) reconstruction error. A higher value for \(\gamma\) means more emphasis on the reconstruction error penalty defined by the autoencoder. A value of 100 is reasonable for the examples.theta
: multiplier for the \(L_{proto}\) loss term. A higher \(\theta\) means more emphasis on the gradients guiding the counterfactual towards the nearest class prototype. A value of 100 worked well for the examples.
When the dataset contains categorical variables, we need to additionally pass the following arguments:
cat_vars
: if the categorical variables have ordinal encodings, this is a dictionary with as keys the categorical columns and values the number of categories for the categorical variable in the dataset. If onehot encoding is applied to the data, then the keys of thecat_vars
dictionary represent the column where each categorical variable starts while the values still return the number of categories.ohe
: a flag (True or False) whether the categories are onehot encoded.
It is also important to remember that the perturbations are applied in the numerical feature space, after the categorical variables have been transformed into numerical features. This has to be reflected by the dimension and values of feature_range
. Imagine for example that we have a dataset with 10 columns. Two of the features are categorical and onehot encoded. They can both take 3 values each. As a result, the number of columns in the dataset is reduced to 6 when we transform those
categorical features to numerical features. As a result, the feature_range
needs to contain the upper and lower ranges for 6 features.
While the default values for the loss term coefficients worked well for the simple examples provided in the notebooks, it is recommended to test their robustness for your own applications.
Fit¶
If we use an encoder to find the class prototypes, we need an additional fit
step on the training data:
cf.fit(x_train)
We also need the fit
step if the data contains categorical features so we can compute the numerical transformations. In practice, most of these optional arguments do not need to be changed.
cf.fit(x_train, d_type='abdm', w=None, disc_perc=[25, 50, 75], standardize_cat_vars=False,
smooth=1., center=True, update_feature_range=True)
d_type
: the distance metric used to compute the pairwise distances between the categories of each categorical variable. As discussed in the introduction, the options are"abdm"
,"mvdm"
or"abdmmvdm"
.w
: if the combined metric"abdmmvdm"
is used,w
is the weight (between 0 and 1) given toabdm
.disc_perc
: forabdm
, we infer context from the other features. If there are continuous numerical features present, these are binned according to the quartiles indisc_perc
before computing the similarity metric.standardize_car_vars
: whether to return the standardized values for the numerical distances of each categorical feature.smooth
: if the difference in the distances between the categorical variables is too large, then a lower value of thesmooth
argument (0, 1) can smoothen out this difference. This would only be relevant if one categorical variable has significantly larger differences between its categories than others. As a result, the counterfactual search process will likely leave that categorical variable unchanged.center
: whether to center the numerical distances of the categorical variables between the min and max feature ranges.update_feature_range
: whether to update thefeature_range
parameter for the categorical variables based on the min and max values it computed in thefit
step.
Explanation¶
We can now explain the instance:
explanation = cf.explain(X, Y=None, target_class=None, k=20, k_type='mean',
threshold=0., verbose=True, print_every=100, log_every=100)
X
: original instanceY
: onehotencoding of class label forX
, inferred from the prediction onX
if None.target_class
: classes considered for the nearest class prototype. Either a list with class indices or None.k
: number of nearest instances used to define the prototype for a class. Defaults to using all instances belonging to the class.k_type
: use either the average encoding of thek
nearest instances in a class as the class prototype (k_type
=’mean’) or the knearest encoding in the class (k_type
=’point’). This parameter is only relevant if an encoder is used to define the prototypes.threshold
: threshold level for the ratio between the distance of the counterfactual to the prototype of the predicted class for the original instance over the distance to the prototype of the predicted class for the counterfactual. If the trust score is below the threshold, the proposed counterfactual does not meet the requirements and is rejected.verbose
: if True, print progress of counterfactual search everyprint_every
steps.log_every
: ifwrite_dir
for Tensorboard is specified, then log losses everylog_every
steps.
The explain
method returns a dictionary with the following key: value pairs:
cf: a dictionary with the overall best counterfactual found. explanation[‘cf’] has the following key: value pairs:
X: the counterfactual instance
class: predicted class for the counterfactual
proba: predicted class probabilities for the counterfactual
grads_graph: gradient values computed from the TF graph with respect to the input features at the counterfactual
grads_num: numerical gradient values with respect to the input features at the counterfactual
orig_class: predicted class for original instance
orig_proba: predicted class probabilities for original instance
all: a dictionary with the iterations as keys and for each iteration a list with counterfactuals found in that iteration as values. So for instance, during the first iteration, explanation[‘all’][0], initially we typically find fairly noisy counterfactuals that improve over the course of the iteration. The counterfactuals for the subsequent iterations then need to be better (sparser) than the previous best counterfactual. So over the next few iterations, we probably find less but better solutions.
Numerical Gradients¶
So far, the whole optimization problem could be defined within the TF graph, making automatic differentiation possible. It is however possible that we do not have access to the model architecture and weights, and are only provided with a predict
function returning probabilities for each class. The counterfactual can then be initialized in the same way:
# define model
cnn = load_model('mnist_cnn.h5')
predict_fn = lambda x: cnn.predict(x)
ae = load_model('mnist_ae.h5')
enc = load_model('mnist_enc.h5')
# initialize explainer
shape = (1,) + x_train.shape[1:]
cf = CounterFactualProto(predict_fn, shape, gamma=100., theta=100.,
ae_model=ae, enc_model=enc, max_iterations=500,
feature_range=(.5, .5), c_init=1., c_steps=4,
eps=(1e2, 1e2), update_num_grad=100)
In this case, we need to evaluate the gradients of the loss function with respect to the input features numerically:
where \(L_{pred}\) is the loss term related to the prediction function, \(p\) is the prediction function and \(x\) are the input features to optimize. There are now 2 additional hyperparameters to consider:
eps
: a tuple to define the perturbation size used to compute the numerical gradients.eps[0]
andeps[1]
are used respectively for \(^{\delta L_{pred}}/_{\delta p}\) and \(^{\delta p}/_{\delta x}\).eps[0]
andeps[1]
can be a combination of float values or numpy arrays. Foreps[0]
, the array dimension should be (1 x nb of prediction categories) and foreps[1]
it should be (1 x nb of features). For the Iris dataset,eps
could look as follows:
eps0 = np.array([[1e2, 1e2, 1e2]]) # 3 prediction categories, equivalent to 1e2
eps1 = np.array([[1e2, 1e2, 1e2, 1e2]]) # 4 features, also equivalent to 1e2
eps = (eps0, eps1)
update_num_grad
: for complex models with a high number of parameters and a high dimensional feature space (e.g. Inception on ImageNet), evaluating numerical gradients can be expensive as they involve prediction calls for each perturbed instance. Theupdate_num_grad
parameter allows you to set a batch size on which to evaluate the numerical gradients, reducing the number of prediction calls required.
We can also remove the prediction loss term by setting c_init
to 0 and only run 1 c_steps
, and still obtain an interpretable counterfactual. This dramatically speeds up the counterfactual search (e.g. by 100x in the MNIST example notebook):
cf = CounterFactualProto(predict_fn, shape, gamma=100., theta=100.,
ae_model=ae, enc_model=enc, max_iterations=500,
feature_range=(.5, .5), c_init=0., c_steps=1)
kd trees¶
So far, we assumed that we have a trained encoder available to find the nearest class prototype. This is however not a hard requirement. As mentioned in the Overview section, we can use kd trees to build class representations, find prototypes by querying the trees for each class and return the k nearest class instance as the closest prototype. We can run the counterfactual as follows:
cf = CounterFactualProto(cnn, shape, use_kdtree=True, theta=10., feature_range=(.5, .5))
cf.fit(x_train, trustscore_kwargs=None)
explanation = cf.explain(X, k=2)
trustscore_kwargs
: keyword arguments for the trust score object used to define the kd trees for each class. Please check the trust scores documentation for more info.
This page was generated from doc/source/methods/TrustScores.ipynb.
Trust Scores¶
Overview¶
It is important to know when a machine learning classifier’s predictions can be trusted. Relying on the classifier’s (uncalibrated) prediction probabilities is not optimal and can be improved upon. Enter trust scores. Trust scores measure the agreement between the classifier and a modified nearest neighbor classifier on the predicted instances. The trust score is the ratio between the distance of the instance to the nearest class different from the predicted class and the distance to the predicted class. A score of 1 would mean that the distance to the predicted class is the same as to the nearest other class. Higher scores correspond to more trustworthy predictions. The original paper on which the algorithm is based is called To Trust Or Not To Trust A Classifier. Our implementation borrows heavily from and extends the authors’ open source code.
The method requires labeled training data to build kd trees for each prediction class. When the classifier makes predictions on a test instance, we measure the distance of the instance to each of the trees. The trust score is then calculated by taking the ratio of the smallest distance to any other class than the predicted class and the distance to the predicted class. The distance is measured to the \(k\)th nearest neighbor in each tree or by using the average distance from the first to the \(k\)th neighbor.
In order to filter out the impact of outliers in the training data, they can optionally be removed using 2 filtering techniques. The first technique builds a kd tree for each class and removes a fraction \(\alpha\) of the training instances with the largest k nearest neighbor (kNN) distance to the other instances in the class. The second fits a kNNclassifier to the training set, and removes a fraction \(\alpha\) of the training instances with the highest prediction class disagreement. Be aware that the first method operates on the prediction class level while the second method runs on the whole training set. It is also important to keep in mind that kNN methods might not be suitable when there are significant scale differences between the input features.
Trust scores can for instance be used as a warning flag for machine learning predictions. If the score drops below a certain value and there is disagreement between the model probabilities and the trust score, the prediction can be explained using techniques like anchors or contrastive explanations.
Trust scores work best for low to medium dimensional feature spaces. When working with high dimensional observations like images, dimensionality reduction methods (e.g. autoencoders or PCA) could be applied as a preprocessing step before computing the scores. This is demonstrated by the following example notebook.
Usage¶
Initialization and fit¶
At initialization, the optional filtering method used to remove outliers during the fit
stage needs to be specified as well:
from alibi.confidence import TrustScore
ts = TrustScore(alpha=.05,
filter_type='distance_knn',
k_filter=10,
leaf_size=40,
metric='euclidean',
dist_filter_type='point')
All the hyperparameters are optional:
alpha
: target fraction of instances to filter out.filter_type
: filter method; one of None (no filtering), distance_knn (first technique discussed in Overview) or probability_knn (second technique).k_filter
: number of neighbors used for the distance or probability based filtering method.leaf_size
: affects the speed and memory usage to build the kd trees. The memory scales with the ratio between the number of samples and the leaf size.metric
: distance metric used for the kd trees. Euclidean by default.dist_filter_type
: point uses the distance to the \(k\)nearest point while mean uses the average distance from the 1st to the \(k\)th nearest point during filtering.
In this example, we use the distance_knn method to filter out 5% of the instances of each class with the largest distance to its 10th nearest neighbor in that class:
ts.fit(X_train, y_train, classes=3)
classes
: equals the number of prediction classes.
X_train is the training set and y_train represents the training labels, either using onehot encoding (OHE) or simple class labels.
Scores¶
The trust scores are simply calculated through the score
method. score
also returns the class labels of the closest not predicted class as a numpy array:
score, closest_class = ts.score(X_test,
y_pred,
k=2,
dist_type='point')
y_pred can again be represented using both OHE or via class labels.
k
: \(k\)th nearest neighbor used to compute distance to for each class.dist_type
: similar to the filtering step, we can compute the distance to each class either to the \(k\)th nearest point (point) or by using the average distance from the 1st to the \(k\)th nearest point (mean).
This page was generated from examples/anchor_tabular_adult.ipynb.
Anchor explanations for income prediction¶
In this example, we will explain predictions of a Random Forest classifier whether a person will make more or less than $50k based on characteristics like age, marital status, gender or occupation. The features are a mixture of ordinal and categorical data and will be preprocessed accordingly.
[1]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from alibi.explainers import AnchorTabular
from alibi.datasets import fetch_adult
Load adult dataset¶
The fetch_adult
function returns a Bunch
object containing the features, the targets, the feature names and a mapping of categorical variables to numbers which are required for formatting the output of the Anchor explainer.
[2]:
adult = fetch_adult()
adult.keys()
[2]:
dict_keys(['data', 'target', 'feature_names', 'target_names', 'category_map'])
[3]:
data = adult.data
target = adult.target
feature_names = adult.feature_names
category_map = adult.category_map
Note that for your own datasets you can use our utility function gen_category_map to create the category map:
[4]:
from alibi.utils.data import gen_category_map
Define shuffled training and test set
[5]:
np.random.seed(0)
data_perm = np.random.permutation(np.c_[data, target])
data = data_perm[:,:1]
target = data_perm[:,1]
[6]:
idx = 30000
X_train,Y_train = data[:idx,:], target[:idx]
X_test, Y_test = data[idx+1:,:], target[idx+1:]
Create feature transformation pipeline¶
Create feature preprocessor. Needs to have ‘fit’ and ‘transform’ methods. Different types of preprocessing can be applied to all or part of the features. In the example below we will standardize ordinal features and apply onehotencoding to categorical features.
Ordinal features:
[7]:
ordinal_features = [x for x in range(len(feature_names)) if x not in list(category_map.keys())]
ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
Categorical features:
[8]:
categorical_features = list(category_map.keys())
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
Combine and fit:
[9]:
preprocessor = ColumnTransformer(transformers=[('num', ordinal_transformer, ordinal_features),
('cat', categorical_transformer, categorical_features)])
preprocessor.fit(X_train)
[9]:
ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with_mean=True,
with_std=True))],
verbose=False),
[0, 8, 9...
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('onehot',
OneHotEncoder(categorical_features=None,
categories=None,
drop=None,
dtype=<class 'numpy.float64'>,
handle_unknown='ignore',
n_values=None,
sparse=True))],
verbose=False),
[1, 2, 3, 4, 5, 6, 7, 11])],
verbose=False)
Train Random Forest model¶
Fit on preprocessed (imputing, OHE, standardizing) data.
[10]:
np.random.seed(0)
clf = RandomForestClassifier(n_estimators=50)
clf.fit(preprocessor.transform(X_train), Y_train)
[10]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
Define predict function
[11]:
predict_fn = lambda x: clf.predict(preprocessor.transform(x))
print('Train accuracy: ', accuracy_score(Y_train, predict_fn(X_train)))
print('Test accuracy: ', accuracy_score(Y_test, predict_fn(X_test)))
Train accuracy: 0.9655333333333334
Test accuracy: 0.855859375
Initialize and fit anchor explainer for tabular data¶
[12]:
explainer = AnchorTabular(predict_fn, feature_names, categorical_names=category_map)
Discretize the ordinal features into quartiles
[13]:
explainer.fit(X_train, disc_perc=[25, 50, 75])
Getting an anchor¶
Below, we get an anchor for the prediction of the first observation in the test set. An anchor is a sufficient condition  that is, when the anchor holds, the prediction should be the same as the prediction for this instance.
[14]:
idx = 0
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, 1))[0]])
Prediction: <=50K
We set the precision threshold to 0.95. This means that predictions on observations where the anchor holds will be the same as the prediction on the explained instance at least 95% of the time.
[15]:
explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])
Anchor: Marital Status = Separated AND Sex = Female
Precision: 0.96
Coverage: 0.11
…or not?¶
Let’s try getting an anchor for a different observation in the test set  one for the which the prediction is >50K
.
[16]:
idx = 6
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, 1))[0]])
explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])
Prediction: >50K
Could not find an anchor satisfying the 0.95 precision constraint. Now returning the best noneligible anchor.
Anchor: Capital Loss > 0.00 AND Marital Status = Married AND Relationship = Husband AND Age > 47.00 AND Race = White AND Sex = Male AND Country = UnitedStates AND Workclass = Selfempnotinc AND Capital Gain <= 0.00
Precision: 0.64
Coverage: 0.00
Notice how no anchor is found!
This is due to the imbalanced dataset (roughly 25:75 high:low earner proportion), so during the sampling stage feature ranges corresponding to lowearners will be oversampled. This is a feature because it can point out an imbalanced dataset, but it can also be fixed by producing balanced datasets to enable anchors to be found for either class.
This page was generated from examples/anchor_tabular_iris.ipynb.
Anchor explanations on the Iris dataset¶
[1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from alibi.explainers import AnchorTabular
Load iris dataset¶
[2]:
dataset = load_iris()
feature_names = dataset.feature_names
class_names = list(dataset.target_names)
Define training and test set
[3]:
idx = 145
X_train,Y_train = dataset.data[:idx,:], dataset.target[:idx]
X_test, Y_test = dataset.data[idx+1:,:], dataset.target[idx+1:]
Train Random Forest model¶
[4]:
np.random.seed(0)
clf = RandomForestClassifier(n_estimators=50)
clf.fit(X_train, Y_train)
[4]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
Define predict function
[5]:
predict_fn = lambda x: clf.predict_proba(x)
Initialize and fit anchor explainer for tabular data¶
[6]:
explainer = AnchorTabular(predict_fn, feature_names)
Discretize the ordinal features into quartiles
[7]:
explainer.fit(X_train, disc_perc=[25, 50, 75])
Getting an anchor¶
Below, we get an anchor for the prediction of the first observation in the test set. An anchor is a sufficient condition  that is, when the anchor holds, the prediction should be the same as the prediction for this instance.
[8]:
idx = 0
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, 1))[0]])
Prediction: virginica
We set the precision threshold to 0.95. This means that predictions on observations where the anchor holds will be the same as the prediction on the explained instance at least 95% of the time.
[9]:
explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])
Anchor: petal width (cm) > 1.80 AND sepal width (cm) <= 2.80
Precision: 0.99
Coverage: 0.07
This page was generated from examples/anchor_text_movie.ipynb.
Anchor explanations for movie sentiment¶
In this example, we will explain why a certain sentence is classified by a logistic regression as having negative or positive sentiment. The logistic regression is trained on negative and positive movie reviews.
[2]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import spacy
from alibi.explainers import AnchorText
from alibi.datasets import fetch_movie_sentiment
from alibi.utils.download import spacy_model
Load movie review dataset¶
The fetch_movie_sentiment
function returns a Bunch
object containing the features, the targets and the target names for the dataset.
[3]:
movies = fetch_movie_sentiment()
movies.keys()
[3]:
dict_keys(['data', 'target', 'target_names'])
[4]:
data = movies.data
labels = movies.target
target_names = movies.target_names
Define shuffled training, validation and test set
[5]:
train, test, train_labels, test_labels = train_test_split(data, labels, test_size=.2, random_state=42)
train, val, train_labels, val_labels = train_test_split(train, train_labels, test_size=.1, random_state=42)
train_labels = np.array(train_labels)
test_labels = np.array(test_labels)
val_labels = np.array(val_labels)
Apply CountVectorizer to training set¶
[6]:
vectorizer = CountVectorizer(min_df=1)
vectorizer.fit(train)
[6]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
Fit model¶
[7]:
np.random.seed(0)
clf = LogisticRegression(solver='liblinear')
clf.fit(vectorizer.transform(train), train_labels)
[7]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)
Define prediction function¶
[8]:
predict_fn = lambda x: clf.predict(vectorizer.transform(x))
Make predictions on train and test sets¶
[9]:
preds_train = predict_fn(train)
preds_val = predict_fn(val)
preds_test = predict_fn(test)
print('Train accuracy', accuracy_score(train_labels, preds_train))
print('Validation accuracy', accuracy_score(val_labels, preds_val))
print('Test accuracy', accuracy_score(test_labels, preds_test))
Train accuracy 0.9801624284382905
Validation accuracy 0.7544910179640718
Test accuracy 0.7589841878294202
Load spaCy model¶
English multitask CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, contextspecific token vectors, POS tags, dependency parse and named entities.
[10]:
model = 'en_core_web_md'
spacy_model(model=model)
nlp = spacy.load(model)
Initialize anchor text explainer¶
[11]:
explainer = AnchorText(nlp, predict_fn)
Explain a prediction¶
[12]:
class_names = movies.target_names
[13]:
text = data[4]
print(text)
a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification .
Prediction:
[14]:
pred = class_names[predict_fn([text])[0]]
alternative = class_names[1  predict_fn([text])[0]]
print('Prediction: %s' % pred)
Prediction: negative
Explanation:
[15]:
np.random.seed(0)
explanation = explainer.explain(text, threshold=0.95, use_unk=True)
use_unk=True means we will perturb examples by replacing words with UNKs. Let us now take a look at the anchor. The word ‘exercise’ basically guarantees a negative prediction.
[16]:
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('\nExamples where anchor applies and model predicts %s:' % pred)
print('\n'.join([x[0] for x in explanation['raw']['examples'][1]['covered_true']]))
print('\nExamples where anchor applies and model predicts %s:' % alternative)
print('\n'.join([x[0] for x in explanation['raw']['examples'][1]['covered_false']]))
Anchor: exercise
Precision: 0.99
Examples where anchor applies and model predicts negative:
a visually UNK UNK UNK opaque and emotionally vapid exercise UNK UNK and mystification UNK
UNK visually UNK UNK narratively opaque UNK emotionally UNK exercise UNK style UNK mystification .
UNK UNK UNK but UNK UNK and emotionally UNK exercise UNK UNK and UNK UNK
UNK UNK flashy but narratively UNK UNK emotionally vapid exercise UNK style and mystification .
UNK UNK UNK UNK narratively opaque and UNK UNK exercise UNK UNK and UNK .
a UNK flashy UNK UNK UNK and emotionally vapid exercise in UNK and mystification UNK
a visually UNK but UNK opaque UNK UNK vapid exercise UNK style and mystification UNK
a visually UNK but UNK opaque UNK UNK vapid exercise UNK UNK and mystification UNK
a UNK flashy but UNK UNK UNK emotionally UNK exercise in UNK and mystification .
a UNK UNK but UNK opaque UNK UNK UNK exercise UNK UNK UNK mystification .
Examples where anchor applies and model predicts positive:
UNK visually UNK UNK narratively UNK and UNK UNK exercise UNK style and UNK UNK
Changing the perturbation distribution¶
Let’s try this with another perturbation distribution, namely one that replaces words by similar words instead of UNKs.
Explanation:
[17]:
np.random.seed(0)
explanation = explainer.explain(text, threshold=0.95, use_unk=False, sample_proba=0.5)
The anchor now shows that we need more to guarantee the negative prediction:
[18]:
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('\nExamples where anchor applies and model predicts %s:' % pred)
print('\n'.join([x[0] for x in explanation['raw']['examples'][1]['covered_true']]))
print('\nExamples where anchor applies and model predicts %s:' % alternative)
print('\n'.join([x[0] for x in explanation['raw']['examples'][1]['covered_false']]))
Anchor: flashy AND emotionally
Precision: 0.96
Examples where anchor applies and model predicts negative:
a visually flashy but psychologically preferable and emotionally unintelligible action behind style and laziness .
the anatomically flashy but narratively renderer and emotionally silly exercise in type and madness .
a properly flashy but narratively greenish and emotionally ludicrous blueprint although style and anathema .
a impressively flashy but extraordinarily bright and emotionally unfunny exercise in accent and anathema .
a visually flashy but supremely opaque and emotionally vapid exercise except bossa and empiricism .
a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification .
a visually flashy but similarly readable and emotionally vapid exercise beyond practicality and bravado .
both incredibly flashy but narratively greenish and emotionally vapid meditation than style and falsehood .
every visually flashy but gloriously opaque and emotionally vapid exercise in custom and mystification .
a visually flashy but narratively opaque and emotionally vapid exercise of custom and negation .
Examples where anchor applies and model predicts positive:
each visually flashy but masterfully scalable and emotionally snooty exercise into style and nastiness .
both vividly flashy but musically delicacy and emotionally vapid hypertrophy for style and misfortune .
both beautifully flashy but extremely voxel and emotionally unoriginal strenght in choice and immaturity .
a visually flashy but narratively opaque and emotionally pushy vitality to choice and fear .
a easily flashy but unnaturally readable and emotionally vapid gym in beauty and rationality .
each spectacularly flashy but narratively solid and emotionally vapid exercise in character and ego .
each visually flashy but narratively solid and emotionally inoffensive gymnast about style and mystification .
each technologically flashy but narratively bright and emotionally vapid vitality in style and arrogance .
a vividly flashy but narratively delicate and emotionally vapid risk from style and ego .
a visually flashy but delightfully opaque and emotionally vapid crossfit among style and xenophobia .
We can make the token perturbation distribution sample words that are more similar to the ground truth word via the top_n
argument. Smaller values (default=100) should result in sentences that are more coherent and thus more in the distribution of natural language which could influence the returned anchor. By setting the use_probability_proba
to True, the sampling distribution for perturbed tokens is proportional to the similarity score between the possible perturbations and the original
word. We can also put more weight on similar words via the temperature
argument. Lower values of temperature
increase the sampling weight of more similar words. The following example will perturb tokens in the original sentence with probability equal to sample_proba
. The sampling distribution for the perturbed tokens is proportional to the similarity score between the ground truth word and each of the top_n
words.
[19]:
np.random.seed(0)
explanation = explainer.explain(text, threshold=0.95, use_similarity_proba=True, sample_proba=0.5,
use_unk=False, top_n=20, temperature=.2)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('\nExamples where anchor applies and model predicts %s:' % pred)
print('\n'.join([x[0] for x in explanation['raw']['examples'][1]['covered_true']]))
print('\nExamples where anchor applies and model predicts %s:' % alternative)
print('\n'.join([x[0] for x in explanation['raw']['examples'][1]['covered_false']]))
Anchor: flashy
Precision: 0.99
Examples where anchor applies and model predicts negative:
this visually flashy but stylistically thin and emotionally vapid exercise into fashion and delusion .
another visually flashy but narratively opaque and intellectually vapid exercise into style and mystification .
another visually flashy but artistically translucent and mentally vapid exercise in style and mystification .
a stylistically flashy but artistically opaque and psychologically vapid exercise in style and selfishness .
a visually flashy but narratively transparent and emotionally vapid exercise to style and mystification .
another functionally flashy but thematically opaque and emotionally vapid exercise in style and delusion .
a artistically flashy but narratively resilient and emotionally classless exercise in retro and delusion .
a concurrently flashy but narratively translucent and emotionally vapid exercise in style and mystification .
a visually flashy but narratively translucent and emotionally vapid exercise across charm and mystification .
a stylistically flashy but aesthetically transparent and emotionally classless exercise near simplicity and mystification .
Examples where anchor applies and model predicts positive:
each visually flashy but narratively opaque and psychologically asinine diet while style and mystification .
any visually flashy but aesthetically transparent and spiritually vapid excercise throughout sensibility and ignorance .
a visually flashy but spiritually opaque and psychologically vacuous excercise into style and selfishness .
This page was generated from examples/anchor_image_imagenet.ipynb.
Anchor explanations for ImageNet¶
[1]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input, decode_predictions
from alibi.datasets import fetch_imagenet
from alibi.explainers import AnchorImage
Load InceptionV3 model pretrained on ImageNet¶
[2]:
model = InceptionV3(weights='imagenet')
WARNING:tensorflow:From /home/janis/.conda/envs/py37/lib/python3.7/sitepackages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Download and preprocess some images from ImageNet¶
The fetch_imagenet function takes as arguments one of the following ImageNet categories: ‘Persian cat’, ‘volcano’, ‘strawberry’, ‘jellyfish’ or ‘centipede’ as well as the number of images to return and the target size of the image.
[3]:
category = 'Persian cat'
image_shape = (299, 299, 3)
data, labels = fetch_imagenet(category, nb_images=10, target_size=image_shape[:2], seed=2, return_X_y=True)
print('Images shape: {}'.format(data.shape))
Images shape: (10, 299, 299, 3)
Apply image preprocessing, make predictions and map predictions back to categories. The output label is a tuple which consists of the class name, description and the prediction probability.
[4]:
images = preprocess_input(data)
preds = model.predict(images)
label = decode_predictions(preds, top=3)
print(label[0])
[('n02123394', 'Persian_cat', 0.909348), ('n03207941', 'dishwasher', 0.0027692039), ('n03832673', 'notebook', 0.0020055303)]
Define prediction function¶
[5]:
predict_fn = lambda x: model.predict(x)
Initialize anchor image explainer¶
The segmentation function will be used to generate superpixels. It is important to have meaningful superpixels in order to generate a useful explanation. Please check scikitimage’s segmentation methods (felzenszwalb, slic and quickshift built in the explainer) for more information.
In the example, the pixels not in the proposed anchor will take the average value of their superpixel. Another option is to superimpose the pixel values from other images which can be passed as a numpy array to the images_background argument.
[6]:
segmentation_fn = 'slic'
kwargs = {'n_segments': 15, 'compactness': 20, 'sigma': .5}
explainer = AnchorImage(predict_fn, image_shape, segmentation_fn=segmentation_fn,
segmentation_kwargs=kwargs, images_background=None)
Explain a prediction¶
The explanation of the below image returns a mask with the superpixels that constitute the anchor.
[7]:
i = 0
plt.imshow(data[i]);
The threshold, p_sample and tau parameters are also key to generate a sensible explanation and ensure fast enough convergence. The threshold defines the minimum fraction of samples for a candidate anchor that need to lead to the same prediction as the original instance. While a higher threshold gives more confidence in the anchor, it also leads to longer computation time. p_sample determines the fraction of superpixels that are changed to either the average value of the superpixel or the pixel value for the superimposed image. The pixels in the proposed anchors are of course unchanged. The parameter tau determines when we assume convergence. A bigger value for tau means faster convergence but also looser anchor restrictions.
[8]:
image = images[i]
np.random.seed(0)
explanation = explainer.explain(image, threshold=.95, p_sample=.5, tau=0.25)
Superpixels in the anchor:
[9]:
plt.imshow(explanation['anchor']);
A visualization of all the superpixels:
[10]:
plt.imshow(explanation['segments']);
This page was generated from examples/anchor_image_fashion_mnist.ipynb.
Anchor explanations for fashion MNIST¶
[1]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Input
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
from alibi.explainers import AnchorImage
Load and prepare fashion MNIST data¶
[2]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
print('x_train shape:', x_train.shape, 'y_train shape:', y_train.shape)
x_train shape: (60000, 28, 28) y_train shape: (60000,)
[3]:
idx = 0
plt.imshow(x_train[idx]);
Scale, reshape and categorize data
[4]:
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = np.reshape(x_train, x_train.shape + (1,))
x_test = np.reshape(x_test, x_test.shape + (1,))
print('x_train shape:', x_train.shape, 'x_test shape:', x_test.shape)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print('y_train shape:', y_train.shape, 'y_test shape:', y_test.shape)
x_train shape: (60000, 28, 28, 1) x_test shape: (10000, 28, 28, 1)
y_train shape: (60000, 10) y_test shape: (10000, 10)
Define CNN model¶
[5]:
def model():
x_in = Input(shape=(28, 28, 1))
x = Conv2D(filters=64, kernel_size=2, padding='same', activation='relu')(x_in)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Conv2D(filters=32, kernel_size=2, padding='same', activation='relu')(x)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x_out = Dense(10, activation='softmax')(x)
cnn = Model(inputs=x_in, outputs=x_out)
cnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return cnn
[6]:
cnn = model()
cnn.summary()
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0
_________________________________________________________________
conv2d (Conv2D) (None, 28, 28, 64) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 14, 14, 64) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 14, 14, 32) 8224
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 7, 7, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 1568) 0
_________________________________________________________________
dense (Dense) (None, 256) 401664
_________________________________________________________________
dropout_2 (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 2570
=================================================================
Total params: 412,778
Trainable params: 412,778
Nontrainable params: 0
_________________________________________________________________
Train model¶
[7]:
cnn.fit(x_train, y_train, batch_size=64, epochs=3)
Epoch 1/3
60000/60000 [==============================]  38s 630us/sample  loss: 0.5847  acc: 0.7862
Epoch 2/3
60000/60000 [==============================]  39s 643us/sample  loss: 0.4134  acc: 0.8507
Epoch 3/3
60000/60000 [==============================]  39s 644us/sample  loss: 0.3660  acc: 0.8671
[7]:
<tensorflow.python.keras.callbacks.History at 0x7ffb62bdafd0>
[8]:
# Evaluate the model on test set
score = cnn.evaluate(x_test, y_test, verbose=0)
print('Test accuracy: ', score[1])
Test accuracy: 0.8839
Define superpixels¶
Function to generate rectangular superpixels for a given image. Alternatively, use one of the built in methods. It is important to have meaningful superpixels in order to generate a useful explanation. Please check scikitimage’s segmentation methods (felzenszwalb, slic and quickshift built in the explainer) for more information on the built in methods.
[9]:
def superpixel(image, size=(4, 7)):
segments = np.zeros([image.shape[0], image.shape[1]])
row_idx, col_idx = np.where(segments == 0)
for i, j in zip(row_idx, col_idx):
segments[i, j] = int((image.shape[1]/size[1]) * (i//size[0]) + j//size[1])
return segments
[10]:
segments = superpixel(x_train[idx])
plt.imshow(segments);
Define prediction function¶
[11]:
predict_fn = lambda x: cnn.predict(x)
Initialize anchor image explainer¶
[12]:
image_shape = x_train[idx].shape
explainer = AnchorImage(predict_fn, image_shape, segmentation_fn=superpixel)
Explain a prediction¶
The explanation returns a mask with the superpixels that constitute the anchor.
Image to be explained:
[13]:
i = 0
image = x_test[i]
plt.imshow(image[:,:,0]);
Model prediction:
[14]:
cnn.predict(image.reshape(1, 28, 28, 1)).argmax()
[14]:
9
The predicted category correctly corresponds to the class ankle boot
:
Label 
Description 

0 
Tshirt/top 
1 
Trouser 
2 
Pullover 
3 
Dress 
4 
Coat 
5 
Sandal 
6 
Shirt 
7 
Sneaker 
8 
Bag 
9 
Ankle boot 
Generate explanation:
[15]:
explanation = explainer.explain(image, threshold=.95, p_sample=.8)
Show anchor:
[16]:
plt.imshow(explanation['anchor'][:,:,0]);
From the example, it looks like the arch of the boot alone is sufficient to predict an ankle boot.
This page was generated from examples/cem_mnist.ipynb.
Contrastive Explanations Method (CEM) applied to MNIST¶
The Contrastive Explanation Method (CEM) can generate black box model explanations in terms of pertinent positives (PP) and pertinent negatives (PN). For PP, it finds what should be minimally and sufficiently present (e.g. important pixels in an image) to justify its classification. PN on the other hand identify what should be minimally and necessarily absent from the explained instance in order to maintain the original prediction.
The original paper where the algorithm is based on can be found on arXiv.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
import tensorflow.keras as keras
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Input, UpSampling2D
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
from alibi.explainers import CEM
Load and prepare MNIST data¶
[2]:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
print('x_train shape:', x_train.shape, 'y_train shape:', y_train.shape)
plt.gray()
plt.imshow(x_test[4]);
x_train shape: (60000, 28, 28) y_train shape: (60000,)
Prepare data: scale, reshape and categorize
[3]:
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = np.reshape(x_train, x_train.shape + (1,))
x_test = np.reshape(x_test, x_test.shape + (1,))
print('x_train shape:', x_train.shape, 'x_test shape:', x_test.shape)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print('y_train shape:', y_train.shape, 'y_test shape:', y_test.shape)
x_train shape: (60000, 28, 28, 1) x_test shape: (10000, 28, 28, 1)
y_train shape: (60000, 10) y_test shape: (10000, 10)
[4]:
xmin, xmax = .5, .5
x_train = ((x_train  x_train.min()) / (x_train.max()  x_train.min())) * (xmax  xmin) + xmin
x_test = ((x_test  x_test.min()) / (x_test.max()  x_test.min())) * (xmax  xmin) + xmin
Define and train CNN model¶
[5]:
def cnn_model():
x_in = Input(shape=(28, 28, 1))
x = Conv2D(filters=64, kernel_size=2, padding='same', activation='relu')(x_in)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Conv2D(filters=32, kernel_size=2, padding='same', activation='relu')(x)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Conv2D(filters=32, kernel_size=2, padding='same', activation='relu')(x)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x_out = Dense(10, activation='softmax')(x)
cnn = Model(inputs=x_in, outputs=x_out)
cnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return cnn
[6]:
cnn = cnn_model()
cnn.summary()
cnn.fit(x_train, y_train, batch_size=64, epochs=5, verbose=1)
cnn.save('mnist_cnn.h5', save_format='h5')
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0
_________________________________________________________________
conv2d (Conv2D) (None, 28, 28, 64) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 14, 14, 64) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 14, 14, 32) 8224
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 7, 7, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 7, 7, 32) 4128
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 3, 3, 32) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 3, 3, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 288) 0
_________________________________________________________________
dense (Dense) (None, 256) 73984
_________________________________________________________________
dropout_3 (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 2570
=================================================================
Total params: 89,226
Trainable params: 89,226
Nontrainable params: 0
_________________________________________________________________
Evaluate the model on test set
[7]:
cnn = load_model('mnist_cnn.h5')
score = cnn.evaluate(x_test, y_test, verbose=0)
print('Test accuracy: ', score[1])
Test accuracy: 0.9867
Define and train autoencoder¶
[8]:
def ae_model():
x_in = Input(shape=(28, 28, 1))
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x_in)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
encoded = Conv2D(1, (3, 3), activation=None, padding='same')(x)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
decoded = Conv2D(1, (3, 3), activation=None, padding='same')(x)
autoencoder = Model(x_in, decoded)
autoencoder.compile(optimizer='adam', loss='mse')
return autoencoder
[9]:
ae = ae_model()
ae.summary()
ae.fit(x_train, x_train, batch_size=128, epochs=4, validation_data=(x_test, x_test), verbose=0)
ae.save('mnist_ae.h5', save_format='h5')
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 28, 28, 1)] 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 28, 28, 16) 160
_________________________________________________________________
conv2d_4 (Conv2D) (None, 28, 28, 16) 2320
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 16) 0
_________________________________________________________________
conv2d_5 (Conv2D) (None, 14, 14, 1) 145
_________________________________________________________________
conv2d_6 (Conv2D) (None, 14, 14, 16) 160
_________________________________________________________________
up_sampling2d (UpSampling2D) (None, 28, 28, 16) 0
_________________________________________________________________
conv2d_7 (Conv2D) (None, 28, 28, 16) 2320
_________________________________________________________________
conv2d_8 (Conv2D) (None, 28, 28, 1) 145
=================================================================
Total params: 5,250
Trainable params: 5,250
Nontrainable params: 0
_________________________________________________________________
Compare original with decoded images
[10]:
ae = load_model('mnist_ae.h5')
decoded_imgs = ae.predict(x_test)
n = 5
plt.figure(figsize=(20, 4))
for i in range(1, n+1):
# display original
ax = plt.subplot(2, n, i)
plt.imshow(x_test[i].reshape(28, 28))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, n, i + n)
plt.imshow(decoded_imgs[i].reshape(28, 28))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
Generate contrastive explanation with pertinent negative¶
Explained instance:
[11]:
idx = 15
X = x_test[idx].reshape((1,) + x_test[idx].shape)
[12]:
plt.imshow(X.reshape(28, 28));
Model prediction:
[13]:
cnn.predict(X).argmax(), cnn.predict(X).max()
[13]:
(5, 0.9999138)
CEM parameters:
[14]:
mode = 'PN' # 'PN' (pertinent negative) or 'PP' (pertinent positive)
shape = (1,) + x_train.shape[1:] # instance shape
kappa = 0. # minimum difference needed between the prediction probability for the perturbed instance on the
# class predicted by the original instance and the max probability on the other classes
# in order for the first loss term to be minimized
beta = .1 # weight of the L1 loss term
gamma = 100 # weight of the optional autoencoder loss term
c_init = 1. # initial weight c of the loss term encouraging to predict a different class (PN) or
# the same class (PP) for the perturbed instance compared to the original instance to be explained
c_steps = 10 # nb of updates for c
max_iterations = 1000 # nb of iterations per value of c
feature_range = (x_train.min(),x_train.max()) # feature range for the perturbed instance
clip = (1000.,1000.) # gradient clipping
lr = 1e2 # initial learning rate
no_info_val = 1. # a value, float or featurewise, which can be seen as containing no info to make a prediction
# perturbations towards this value means removing features, and away means adding features
# for our MNIST images, the background (0.5) is the least informative,
# so positive/negative perturbations imply adding/removing features
Generate pertinent negative:
[15]:
# initialize CEM explainer and explain instance
cem = CEM(cnn, mode, shape, kappa=kappa, beta=beta, feature_range=feature_range,
gamma=gamma, ae_model=ae, max_iterations=max_iterations,
c_init=c_init, c_steps=c_steps, learning_rate_init=lr, clip=clip, no_info_val=no_info_val)
explanation = cem.explain(X)
Pertinent negative:
[16]:
print('Pertinent negative prediction: {}'.format(explanation[mode + '_pred']))
plt.imshow(explanation[mode].reshape(28, 28));
Pertinent negative prediction: 8
Generate pertinent positive¶
[17]:
mode = 'PP'
[18]:
# initialize CEM explainer and explain instance
cem = CEM(cnn, mode, shape, kappa=kappa, beta=beta, feature_range=feature_range,
gamma=gamma, ae_model=ae, max_iterations=max_iterations,
c_init=c_init, c_steps=c_steps, learning_rate_init=lr, clip=clip, no_info_val=no_info_val)
explanation = cem.explain(X)
Pertinent positive:
[19]:
print('Pertinent positive prediction: {}'.format(explanation[mode + '_pred']))
plt.imshow(explanation[mode].reshape(28, 28));
Pertinent positive prediction: 5
Clean up:
[ ]:
os.remove('mnist_cnn.h5')
os.remove('mnist_ae.h5')
This page was generated from examples/cem_iris.ipynb.
Contrastive Explanations Method (CEM) applied to Iris dataset¶
The Contrastive Explanation Method (CEM) can generate black box model explanations in terms of pertinent positives (PP) and pertinent negatives (PN). For PP, it finds what should be minimally and sufficiently present (e.g. important pixels in an image) to justify its classification. PN on the other hand identify what should be minimally and necessarily absent from the explained instance in order to maintain the original prediction.
The original paper where the algorithm is based on can be found on arXiv.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris
from alibi.explainers import CEM
Load and prepare Iris dataset¶
[2]:
dataset = load_iris()
feature_names = dataset.feature_names
class_names = list(dataset.target_names)
Scale data
[3]:
dataset.data = (dataset.data  dataset.data.mean(axis=0)) / dataset.data.std(axis=0)
Define training and test set
[4]:
idx = 145
x_train,y_train = dataset.data[:idx,:], dataset.target[:idx]
x_test, y_test = dataset.data[idx+1:,:], dataset.target[idx+1:]
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
Define and train logistic regression model¶
[5]:
def lr_model():
x_in = Input(shape=(4,))
x_out = Dense(3, activation='softmax')(x_in)
lr = Model(inputs=x_in, outputs=x_out)
lr.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
return lr
[6]:
lr = lr_model()
lr.summary()
lr.fit(x_train, y_train, batch_size=16, epochs=500, verbose=0)
lr.save('iris_lr.h5', save_format='h5')
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 4)] 0
_________________________________________________________________
dense (Dense) (None, 3) 15
=================================================================
Total params: 15
Trainable params: 15
Nontrainable params: 0
_________________________________________________________________
Generate contrastive explanation with pertinent negative¶
Explained instance:
[7]:
idx = 0
X = x_test[idx].reshape((1,) + x_test[idx].shape)
print('Prediction on instance to be explained: {}'.format(class_names[np.argmax(lr.predict(X))]))
print('Prediction probabilities for each class on the instance: {}'.format(lr.predict(X)))
Prediction on instance to be explained: virginica
Prediction probabilities for each class on the instance: [[2.4778183e04 2.7473542e01 7.2501677e01]]
CEM parameters:
[8]:
mode = 'PN' # 'PN' (pertinent negative) or 'PP' (pertinent positive)
shape = (1,) + x_train.shape[1:] # instance shape
kappa = .2 # minimum difference needed between the prediction probability for the perturbed instance on the
# class predicted by the original instance and the max probability on the other classes
# in order for the first loss term to be minimized
beta = .1 # weight of the L1 loss term
c_init = 10. # initial weight c of the loss term encouraging to predict a different class (PN) or
# the same class (PP) for the perturbed instance compared to the original instance to be explained
c_steps = 10 # nb of updates for c
max_iterations = 1000 # nb of iterations per value of c
feature_range = (x_train.min(axis=0).reshape(shape).1, # feature range for the perturbed instance
x_train.max(axis=0).reshape(shape)+.1) # can be either a float or array of shape (1xfeatures)
clip = (1000.,1000.) # gradient clipping
lr_init = 1e2 # initial learning rate
Generate pertinent negative:
[9]:
# define model
lr = load_model('iris_lr.h5')
# initialize CEM explainer and explain instance
cem = CEM(lr, mode, shape, kappa=kappa, beta=beta, feature_range=feature_range,
max_iterations=max_iterations, c_init=c_init, c_steps=c_steps,
learning_rate_init=lr_init, clip=clip)
cem.fit(x_train, no_info_type='median') # we need to define what feature values contain the least
# info wrt predictions
# here we will naively assume that the featurewise median
# contains no info; domain knowledge helps!
explanation = cem.explain(X, verbose=False)
[10]:
print('Original instance: {}'.format(explanation['X']))
print('Predicted class: {}'.format(class_names[explanation['X_pred']]))
Original instance: [[ 0.55333328 1.28296331 0.70592084 0.92230284]]
Predicted class: virginica
[11]:
print('Pertinent negative: {}'.format(explanation[mode]))
print('Predicted class: {}'.format(class_names[explanation[mode + '_pred']]))
Pertinent negative: [[ 0.5533333 1.2829633 0.20453337 0.92230284]]
Predicted class: versicolor
Store explanation to plot later on:
[12]:
expl = {}
expl['PN'] = explanation[mode]
expl['PN_pred'] = explanation[mode + '_pred']
Generate pertinent positive¶
[13]:
mode = 'PP'
Generate pertinent positive:
[14]:
# define model
lr = load_model('iris_lr.h5')
# initialize CEM explainer and explain instance
cem = CEM(lr, mode, shape, kappa=kappa, beta=beta, feature_range=feature_range,
max_iterations=max_iterations, c_init=c_init, c_steps=c_steps,
learning_rate_init=lr_init, clip=clip)
cem.fit(x_train, no_info_type='median')
explanation = cem.explain(X, verbose=False)
[15]:
print('Pertinent positive: {}'.format(explanation[mode]))
print('Predicted class: {}'.format(class_names[explanation[mode + '_pred']]))
Pertinent positive: [[7.44469730e09 3.47054341e08 3.70659438e01 9.17898107e01]]
Predicted class: virginica
[16]:
expl['PP'] = explanation[mode]
expl['PP_pred'] = explanation[mode + '_pred']
Visualize PN and PP¶
Let’s visualize the generated explanations to check if the perturbed instances make sense.
Create dataframe from standardized data:
[17]:
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['species'] = np.array([dataset.target_names[i] for i in dataset.target])
Highlight explained instance and add pertinent negative and positive to the dataset:
[18]:
pn = pd.DataFrame(expl['PN'], columns=dataset.feature_names)
pn['species'] = 'PN_' + class_names[expl['PN_pred']]
pp = pd.DataFrame(expl['PP'], columns=dataset.feature_names)
pp['species'] = 'PP_' + class_names[expl['PP_pred']]
orig_inst = pd.DataFrame(explanation['X'], columns=dataset.feature_names)
orig_inst['species'] = 'orig_' + class_names[explanation['X_pred']]
df = df.append([pn, pp, orig_inst], ignore_index=True)
Pair plots between the features show that the pertinent negative is pushed from the original instance (versicolor) into the virginica distribution while the pertinent positive moved away from the virginica distribution.
[19]:
fig = sns.pairplot(df, hue='species', diag_kind='hist');
Use numerical gradients in CEM¶
If we do not have access to the Keras or TensorFlow model weights, we can use numerical gradients for the first term in the loss function that needs to be minimized (eq. 1 and 4 in the paper).
CEM parameters:
[20]:
mode = 'PN'
If numerical gradients are used to compute:
with L = loss function; p = predict function and x the parameter to optimize, then the tuple eps can be used to define the perturbation used to compute the derivatives. eps[0] is used to calculate the first partial derivative term and eps[1] is used for the second term. eps[0] and eps[1] can be a combination of float values or numpy arrays. For eps[0], the array dimension should be (1 x nb of prediction categories) and for eps[1] it should be (1 x nb of features).
[21]:
eps0 = np.array([[1e2, 1e2, 1e2]]) # 3 prediction categories, equivalent to 1e2
eps1 = np.array([[1e2, 1e2, 1e2, 1e2]]) # 4 features, also equivalent to 1e2
eps = (eps0, eps1)
For complex models with a high number of parameters and a high dimensional feature space (e.g. Inception on ImageNet), evaluating numerical gradients can be expensive as they involve multiple prediction calls for each perturbed instance. The update_num_grad parameter allows you to set a batch size on which to evaluate the numerical gradients, drastically reducing the number of prediction calls required.
[22]:
update_num_grad = 1
Generate pertinent negative:
[23]:
# define model
lr = load_model('iris_lr.h5')
predict_fn = lambda x: lr.predict(x) # only pass the predict fn which takes numpy arrays to CEM
# explainer can no longer minimize wrt model weights
# initialize CEM explainer and explain instance
cem = CEM(predict_fn, mode, shape, kappa=kappa, beta=beta,
feature_range=feature_range, max_iterations=max_iterations,
eps=eps, c_init=c_init, c_steps=c_steps, learning_rate_init=lr_init,
clip=clip, update_num_grad=update_num_grad)
cem.fit(x_train, no_info_type='median')
explanation = cem.explain(X, verbose=False)
[24]:
print('Original instance: {}'.format(explanation['X']))
print('Predicted class: {}'.format(class_names[explanation['X_pred']]))
Original instance: [[ 0.55333328 1.28296331 0.70592084 0.92230284]]
Predicted class: virginica
[25]:
print('Pertinent negative: {}'.format(explanation[mode]))
print('Predicted class: {}'.format(class_names[explanation[mode + '_pred']]))
Pertinent negative: [[ 0.5533333 1.2829633 0.20392278 0.92230284]]
Predicted class: versicolor
Clean up:
[26]:
os.remove('iris_lr.h5')
This page was generated from examples/cf_mnist.ipynb.
Counterfactual instances on MNIST¶
Given a test instance \(X\), this method can generate counterfactual instances \(X^\prime\) given a desired counterfactual class \(t\) which can either be a class specified upfront or any other class that is different from the predicted class of \(X\).
The loss function for finding counterfactuals is the following:
The first loss term, guides the search towards instances \(X^\prime\) for which the predicted class probability \(f_t(X^\prime)\) is close to a prespecified target class probability \(p_t\) (typically \(p_t=1\)). The second loss term ensures that the counterfactuals are close in the feature space to the original test instance.
In this notebook we illustrate the usage of the basic counterfactual algorithm on the MNIST dataset.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Input
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
from time import time
from alibi.explainers import CounterFactual
Load and prepare MNIST data¶
[2]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
print('x_train shape:', x_train.shape, 'y_train shape:', y_train.shape)
plt.gray()
plt.imshow(x_test[1]);
x_train shape: (60000, 28, 28) y_train shape: (60000,)
Prepare data: scale, reshape and categorize
[3]:
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = np.reshape(x_train, x_train.shape + (1,))
x_test = np.reshape(x_test, x_test.shape + (1,))
print('x_train shape:', x_train.shape, 'x_test shape:', x_test.shape)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print('y_train shape:', y_train.shape, 'y_test shape:', y_test.shape)
x_train shape: (60000, 28, 28, 1) x_test shape: (10000, 28, 28, 1)
y_train shape: (60000, 10) y_test shape: (10000, 10)
[4]:
xmin, xmax = .5, .5
x_train = ((x_train  x_train.min()) / (x_train.max()  x_train.min())) * (xmax  xmin) + xmin
x_test = ((x_test  x_test.min()) / (x_test.max()  x_test.min())) * (xmax  xmin) + xmin
Define and train CNN model¶
[5]:
def cnn_model():
x_in = Input(shape=(28, 28, 1))
x = Conv2D(filters=64, kernel_size=2, padding='same', activation='relu')(x_in)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Conv2D(filters=32, kernel_size=2, padding='same', activation='relu')(x)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x_out = Dense(10, activation='softmax')(x)
cnn = Model(inputs=x_in, outputs=x_out)
cnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return cnn
[6]:
cnn = cnn_model()
cnn.summary()
cnn.fit(x_train, y_train, batch_size=64, epochs=3, verbose=0)
cnn.save('mnist_cnn.h5')
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0
_________________________________________________________________
conv2d (Conv2D) (None, 28, 28, 64) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 14, 14, 64) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 14, 14, 32) 8224
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 7, 7, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 1568) 0
_________________________________________________________________
dense (Dense) (None, 256) 401664
_________________________________________________________________
dropout_2 (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 2570
=================================================================
Total params: 412,778
Trainable params: 412,778
Nontrainable params: 0
_________________________________________________________________
Evaluate the model on test set
[7]:
cnn = load_model('mnist_cnn.h5')
score = cnn.evaluate(x_test, y_test, verbose=0)
print('Test accuracy: ', score[1])
Test accuracy: 0.9887
Generate counterfactuals¶
Original instance:
[8]:
X = x_test[0].reshape((1,) + x_test[0].shape)
plt.imshow(X.reshape(28, 28));
Counterfactual parameters:
[9]:
shape = (1,) + x_train.shape[1:]
target_proba = 1.0
tol = 0.01 # want counterfactuals with p(class)>0.99
target_class = 'other' # any class other than 7 will do
max_iter = 1000
lam_init = 1e1
max_lam_steps = 10
learning_rate_init = 0.1
feature_range = (x_train.min(),x_train.max())
Run counterfactual:
[10]:
# initialize explainer
cf = CounterFactual(cnn, shape=shape, target_proba=target_proba, tol=tol,
target_class=target_class, max_iter=max_iter, lam_init=lam_init,
max_lam_steps=max_lam_steps, learning_rate_init=learning_rate_init,
feature_range=feature_range)
start_time = time()
explanation = cf.explain(X)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Explanation took 8.739 sec
Results:
[11]:
pred_class = explanation['cf']['class']
proba = explanation['cf']['proba'][0][pred_class]
print(f'Counterfactual prediction: {pred_class} with probability {proba}')
plt.imshow(explanation['cf']['X'].reshape(28, 28));
Counterfactual prediction: 9 with probability 0.9919260144233704
The counterfactual starting from a 7 moves towards the closest class as determined by the model and the data  in this case a 9. The evolution of the counterfactual during the iterations over \(\lambda\) can be seen below (note that all of the following examples satisfy the counterfactual condition):
[12]:
n_cfs = np.array([len(explanation['all'][iter_cf]) for iter_cf in range(max_lam_steps)])
examples = {}
for ix, n in enumerate(n_cfs):
if n>0:
examples[ix] = {'ix': ix, 'lambda': explanation['all'][ix][0]['lambda'],
'X': explanation['all'][ix][0]['X']}
columns = len(examples) + 1
rows = 1
fig = plt.figure(figsize=(16,6))
for i, key in enumerate(examples.keys()):
ax = plt.subplot(rows, columns, i+1)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.imshow(examples[key]['X'].reshape(28,28))
plt.title(f'Iteration: {key}')
Typically, the first few iterations find counterfactuals that are out of distribution, while the later iterations make the counterfactual more sparse and interpretable.
Let’s now try to steer the counterfactual to a specific class:
[13]:
target_class = 1
cf = CounterFactual(cnn, shape=shape, target_proba=target_proba, tol=tol,
target_class=target_class, max_iter=max_iter, lam_init=lam_init,
max_lam_steps=max_lam_steps, learning_rate_init=learning_rate_init,
feature_range=feature_range)
explanation = start_time = time()
explanation = cf.explain(X)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Explanation took 6.558 sec
Results:
[14]:
pred_class = explanation['cf']['class']
proba = explanation['cf']['proba'][0][pred_class]
print(f'Counterfactual prediction: {pred_class} with probability {proba}')
plt.imshow(explanation['cf']['X'].reshape(28, 28));
Counterfactual prediction: 1 with probability 0.998101532459259
As you can see, by specifying a class, the search process can’t go towards the closest class to the test instance (in this case a 9 as we saw previously), so the resulting counterfactual might be less interpretable. We can gain more insight by looking at the difference between the counterfactual and the original instance:
[15]:
plt.imshow((explanation['cf']['X']  X).reshape(28, 28));
This shows that the counterfactual is stripping out the top part of the 7 to make to result in a prediction of 1  not very surprising as the dataset has a lot of examples of diagonally slanted 1’s.
Clean up:
[16]:
os.remove('mnist_cnn.h5')
This page was generated from examples/cfproto_mnist.ipynb.
Counterfactuals guided by prototypes on MNIST¶
This method is described in the Interpretable Counterfactual Explanations Guided by Prototypes paper and can generate counterfactual instances guided by class prototypes. It means that for a certain instance X, the method builds a prototype for each prediction class using either an autoencoder or kd trees. The nearest prototype class other than the originally predicted class is then used to guide the counterfactual search. For example, in MNIST the closest class to a 7 could be a 9. As a result, the prototype loss term will try to minimize the distance between the proposed counterfactual and the prototype of a 9. This speeds up the search towards a satisfactory counterfactual by steering it towards an interpretable solution from the start of the optimization. It also helps to avoid outofdistribution counterfactuals with the perturbations driven to a prototype of another class.
The loss function to be optimized is the following:
\(Loss\) = c\(L_{pred}\) + \(\beta\)\(L_{1}\) + \(L_{2}\) + \(L_{AE}\) + \(L_{proto}\)
The first loss term relates to the model’s prediction function, the following 2 terms define the elastic net regularization while the last 2 terms are optional. The aim of \(L_{AE}\) is to penalize outofdistribution counterfactuals while \(L_{proto}\) guides the counterfactual to a prototype. When we only have acces to the model’s prediction function and cannot fully enjoy the benefits of automatic differentiation, the prototypes allow us to drop the prediction function loss term \(L_{pred}\) and still generate high quality counterfactuals. This drastically reduces the number of prediction calls made during the numerical gradient update step and again speeds up the search.
Other options include generating counterfactuals for specific classes or including trust score constraints to ensure that the counterfactual is close enough to the newly predicted class compared to the original class. Different use cases are illustrated throughout this notebook.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Input, UpSampling2D
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
from time import time
from alibi.explainers import CounterFactualProto
Load and prepare MNIST data¶
[2]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
print('x_train shape:', x_train.shape, 'y_train shape:', y_train.shape)
plt.gray()
plt.imshow(x_test[1]);
x_train shape: (60000, 28, 28) y_train shape: (60000,)
Prepare data: scale, reshape and categorize
[3]:
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = np.reshape(x_train, x_train.shape + (1,))
x_test = np.reshape(x_test, x_test.shape + (1,))
print('x_train shape:', x_train.shape, 'x_test shape:', x_test.shape)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print('y_train shape:', y_train.shape, 'y_test shape:', y_test.shape)
x_train shape: (60000, 28, 28, 1) x_test shape: (10000, 28, 28, 1)
y_train shape: (60000, 10) y_test shape: (10000, 10)
[4]:
xmin, xmax = .5, .5
x_train = ((x_train  x_train.min()) / (x_train.max()  x_train.min())) * (xmax  xmin) + xmin
x_test = ((x_test  x_test.min()) / (x_test.max()  x_test.min())) * (xmax  xmin) + xmin
Define and train CNN model¶
[5]:
def cnn_model():
x_in = Input(shape=(28, 28, 1))
x = Conv2D(filters=32, kernel_size=2, padding='same', activation='relu')(x_in)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Conv2D(filters=64, kernel_size=2, padding='same', activation='relu')(x)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x_out = Dense(10, activation='softmax')(x)
cnn = Model(inputs=x_in, outputs=x_out)
cnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return cnn
[6]:
cnn = cnn_model()
cnn.fit(x_train, y_train, batch_size=32, epochs=3, verbose=0)
cnn.save('mnist_cnn.h5', save_format='h5')
Evaluate the model on test set
[7]:
cnn = load_model('mnist_cnn.h5')
score = cnn.evaluate(x_test, y_test, verbose=0)
print('Test accuracy: ', score[1])
Test accuracy: 0.9887
Define and train autoencoder¶
[8]:
def ae_model():
# encoder
x_in = Input(shape=(28, 28, 1))
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x_in)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
encoded = Conv2D(1, (3, 3), activation=None, padding='same')(x)
encoder = Model(x_in, encoded)
# decoder
dec_in = Input(shape=(14, 14, 1))
x = Conv2D(16, (3, 3), activation='relu', padding='same')(dec_in)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
decoded = Conv2D(1, (3, 3), activation=None, padding='same')(x)
decoder = Model(dec_in, decoded)
# autoencoder = encoder + decoder
x_out = decoder(encoder(x_in))
autoencoder = Model(x_in, x_out)
autoencoder.compile(optimizer='adam', loss='mse')
return autoencoder, encoder, decoder
[9]:
ae, enc, dec = ae_model()
ae.fit(x_train, x_train, batch_size=128, epochs=4, validation_data=(x_test, x_test), verbose=0)
ae.save('mnist_ae.h5', save_format='h5')
enc.save('mnist_enc.h5', save_format='h5')
Compare original with decoded images
[10]:
ae = load_model('mnist_ae.h5')
enc = load_model('mnist_enc.h5', compile=False)
decoded_imgs = ae.predict(x_test)
n = 5
plt.figure(figsize=(20, 4))
for i in range(1, n+1):
# display original
ax = plt.subplot(2, n, i)
plt.imshow(x_test[i].reshape(28, 28))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, n, i + n)
plt.imshow(decoded_imgs[i].reshape(28, 28))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
Generate counterfactual guided by the nearest class prototype¶
Original instance:
[11]:
X = x_test[0].reshape((1,) + x_test[0].shape)
plt.imshow(X.reshape(28, 28));
Counterfactual parameters:
[12]:
shape = (1,) + x_train.shape[1:]
gamma = 100.
theta = 100.
c_init = 1.
c_steps = 2
max_iterations = 1000
feature_range = (x_train.min(),x_train.max())
Run counterfactual:
[25]:
# initialize explainer, fit and generate counterfactual
cf = CounterFactualProto(cnn, shape, gamma=gamma, theta=theta,
ae_model=ae, enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
start_time = time()
cf.fit(x_train) # find class prototypes
print('Time to find prototypes each class: {:.3f} sec'.format(time()  start_time))
start_time = time()
explanation = cf.explain(X)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Time to find prototypes each class: 10.619 sec
Explanation took 8.659 sec
Results:
[26]:
print('Counterfactual prediction: {}'.format(explanation['cf']['class']))
print('Closest prototype class: {}'.format(cf.id_proto))
plt.imshow(explanation['cf']['X'].reshape(28, 28));
Counterfactual prediction: 9
Closest prototype class: 9
The counterfactual starting from a 7 moves towards its closest prototype class: a 9. The evolution of the counterfactual during the first iteration can be seen below:
[28]:
iter_cf = 0
print('iteration c {}'.format(iter_cf))
n = len(explanation['all'][iter_cf])
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(1, n+1, i+1)
plt.imshow(explanation['all'][iter_cf][i].reshape(28, 28))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
iteration c 0
Typically, the first few iterations already steer the 7 towards a 9, while the later iterations make the counterfactual more sparse.
Prototypes defined by the \(k\) nearest encoded instances¶
In the above example, the class prototypes are defined by the average encoding of all instances belonging to the specific class. Instead, we can also select only the \(k\) nearest encoded instances of a class to the encoded instance to be explained and use the average over those \(k\) encodings as the prototype.
[29]:
# initialize explainer, fit and generate counterfactuals
cf = CounterFactualProto(cnn, shape, gamma=gamma, theta=theta,
ae_model=ae, enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
cf.fit(x_train)
explanation_k1 = cf.explain(X, k=1, k_type='mean')
explanation_k20 = cf.explain(X, k=20, k_type='mean')
Results for \(k\) equals 1:
[30]:
print('Counterfactual prediction: {}'.format(explanation_k1['cf']['class']))
print('Closest prototype class: {}'.format(cf.id_proto))
plt.imshow(explanation_k1['cf']['X'].reshape(28, 28));
Counterfactual prediction: 9
Closest prototype class: 9
Results for \(k\) equals 20:
[31]:
print('Counterfactual prediction: {}'.format(explanation_k20['cf']['class']))
print('Closest prototype class: {}'.format(cf.id_proto))
plt.imshow(explanation_k20['cf']['X'].reshape(28, 28));
Counterfactual prediction: 9
Closest prototype class: 9
A lower value of \(k\) typically leads to counterfactuals that look more like the original instance and less like an average instance of the counterfactual class.
Remove the autoencoder loss term \(L_{AE}\)¶
In the previous example, we used both an autoencoder loss term to penalize a counterfactual which falls outside of the training data distribution as well as an encoder loss term to guide the counterfactual to the nearest prototype class. In the next example we get rid of the autoencoder loss term to speed up the counterfactual search and still generate decent counterfactuals:
[32]:
# initialize explainer, fit and generate counterfactuals
cf = CounterFactualProto(cnn, shape, gamma=gamma, theta=theta,
enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
cf.fit(x_train)
start_time = time()
explanation = cf.explain(X, k=1)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Explanation took 6.284 sec
Results:
[33]:
print('Counterfactual prediction: {}'.format(explanation['cf']['class']))
print('Closest prototype class: {}'.format(cf.id_proto))
plt.imshow(explanation['cf']['X'].reshape(28, 28));
Counterfactual prediction: 9
Closest prototype class: 9
Specify prototype classes¶
For multiclass predictions, we might be interested to generate counterfactuals for certain classes while avoiding others. The following example illustrates how to do this:
[34]:
X = x_test[12].reshape((1,) + x_test[1].shape)
plt.imshow(X.reshape(28, 28));
[35]:
# initialize explainer, fit and generate counterfactuals
cf = CounterFactualProto(cnn, shape, gamma=gamma, theta=theta,
ae_model=ae, enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
cf.fit(x_train)
explanation_1 = cf.explain(X, k=5, k_type='mean')
proto_1 = cf.id_proto
explanation_2 = cf.explain(X, k=5, k_type='mean', target_class=[7])
proto_2 = cf.id_proto
The closest class to the 9 is 4. This is evident by looking at the first counterfactual below. For the second counterfactual, we specified that the prototype class used in the search should be a 7. As a result, a counterfactual 7 instead of a 4 is generated.
[36]:
print('Counterfactual prediction: {}'.format(explanation_1['cf']['class']))
print('Closest prototype class: {}'.format(proto_1))
plt.imshow(explanation_1['cf']['X'].reshape(28, 28));
Counterfactual prediction: 4
Closest prototype class: 4
[37]:
print('Counterfactual prediction: {}'.format(explanation_2['cf']['class']))
print('Closest prototype class: {}'.format(proto_2))
plt.imshow(explanation_2['cf']['X'].reshape(28, 28));
Counterfactual prediction: 7
Closest prototype class: 7
Speed up the counterfactual search by removing the predict function loss term¶
We can also remove the prediction loss term and still obtain an interpretable counterfactual. This is especially relevant for fully black box models. When we provide the counterfactual search method with a Keras or TensorFlow model, it is incorporated in the TensorFlow graph and evaluated using automatic differentiation. However, if we only have access to the model’s prediction function, the gradient updates are numerical and typically require a large number of prediction calls because of the prediction loss term \(L_{pred}\). These prediction calls can slow the search down significantly and become a bottleneck. We can represent the gradient of the loss term as follows:
where \(L_{pred}\) is the prediction loss term, \(p\) the prediction function and \(x\) the input features to optimize. For a 28 by 28 MNIST image, the \(^{\delta p}/_{\delta x}\) term alone would require a prediction call with batch size 28x28x2 = 1568. By using the prototypes to guide the search however, we can remove the prediction loss term and only make a single prediction at the end of each gradient update to check whether the predicted class on the proposed counterfactual is different from the original class. We do not necessarily need a Keras or TensorFlow autoencoder either and can use kd trees to find the nearest class prototypes. Please check out this notebook for a practical example.
The first example below removes \(L_{pred}\) from the loss function to bypass the bottleneck. It illustrates the drastic speed improvements over the black box alternative with numerical gradient evaluation while still producing interpretable counterfactual instances.
[38]:
plt.gray()
X = x_test[23].reshape(1, 28, 28, 1)
plt.imshow(X.reshape(28, 28));
[39]:
c_init = 0. # weight on prediction loss term set to 0
c_steps = 1 # no need to find optimal values for c
[40]:
# define a blackbox model
predict_fn = lambda x: cnn.predict(x)
# initialize explainer, fit and generate counterfactuals
cf = CounterFactualProto(predict_fn, shape, gamma=gamma, theta=theta,
ae_model=ae, enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
cf.fit(x_train)
start_time = time()
explanation = cf.explain(X, k=1)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Explanation took 6.271 sec
[41]:
print('Counterfactual prediction: {}'.format(explanation['cf']['class']))
print('Closest prototype class: {}'.format(cf.id_proto))
plt.imshow(explanation['cf']['X'].reshape(28, 28));
Counterfactual prediction: 6
Closest prototype class: 6
Let us know add the \(L_{pred}\) loss term back in the objective function and observe how long it takes to generate a black box counterfactual:
[42]:
c_init = 1.
c_steps = 2
[43]:
# define a blackbox model
predict_fn = lambda x: cnn.predict(x)
# initialize explainer, fit and generate counterfactuals
cf = CounterFactualProto(predict_fn, shape, gamma=gamma, theta=theta,
ae_model=ae, enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
cf.fit(x_train)
start_time = time()
explanation = cf.explain(X, k=1)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Explanation took 545.836 sec
[44]:
print('Counterfactual prediction: {}'.format(explanation['cf']['class']))
print('Closest prototype class: {}'.format(cf.id_proto))
plt.imshow(explanation['cf']['X'].reshape(28, 28));
Counterfactual prediction: 6
Closest prototype class: 6
Clean up:
[32]:
os.remove('mnist_cnn.h5')
os.remove('mnist_ae.h5')
os.remove('mnist_enc.h5')
This page was generated from examples/cfproto_housing.ipynb.
Counterfactuals guided by prototypes on Boston housing dataset¶
This notebook goes through an example of prototypical counterfactuals using kd trees to build the prototypes. Please check out this notebook for a more indepth application of the method on MNIST using (auto)encoders and trust scores.
In this example, we will train a simple neural net to predict whether house prices in the Boston area are above the median value or not. We can then find a counterfactual to see which variables need to be changed to increase or decrease a house price above or below the median value.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
from sklearn.datasets import load_boston
from alibi.explainers import CounterFactualProto
Load and prepare Boston housing dataset¶
[2]:
boston = load_boston()
data = boston.data
target = boston.target
feature_names = boston.feature_names
Transform into classification task: target becomes whether house price is above the overall median or not
[3]:
y = np.zeros((target.shape[0],))
y[np.where(target > np.median(target))[0]] = 1
Remove categorical feature
[4]:
data = np.delete(data, 3, 1)
feature_names = np.delete(feature_names, 3)
Explanation of remaining features:
CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of nonretail business acres per town
RM: average number of rooms per dwelling
AGE: proportion of owneroccupied units built prior to 1940
DIS: weighted distances to five Boston employment centres
RAD: index of accessibility to radial highways
TAX: fullvalue propertytax rate per USD10,000
PTRATIO: pupilteacher ratio by town
B: 1000(Bk  0.63)^2 where Bk is the proportion of blacks by town
LSTAT: % lower status of the population
Standardize data
[5]:
mu = data.mean(axis=0)
sigma = data.std(axis=0)
data = (data  mu) / sigma
Define train and test set
[6]:
idx = 475
x_train,y_train = data[:idx,:], y[:idx]
x_test, y_test = data[idx:,:], y[idx:]
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
Train model¶
[7]:
np.random.seed(0)
tf.set_random_seed(0)
[8]:
def nn_model():
x_in = Input(shape=(12,))
x = Dense(40, activation='relu')(x_in)
x = Dense(40, activation='relu')(x)
x_out = Dense(2, activation='softmax')(x)
nn = Model(inputs=x_in, outputs=x_out)
nn.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
return nn
[9]:
nn = nn_model()
nn.summary()
nn.fit(x_train, y_train, batch_size=64, epochs=500, verbose=0)
nn.save('nn_boston.h5', save_format='h5')
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 12)] 0
_________________________________________________________________
dense (Dense) (None, 40) 520
_________________________________________________________________
dense_1 (Dense) (None, 40) 1640
_________________________________________________________________
dense_2 (Dense) (None, 2) 82
=================================================================
Total params: 2,242
Trainable params: 2,242
Nontrainable params: 0
_________________________________________________________________
[10]:
score = nn.evaluate(x_test, y_test, verbose=0)
print('Test accuracy: ', score[1])
Test accuracy: 0.83870965
Generate counterfactual guided by the nearest class prototype¶
Original instance:
[11]:
X = x_test[1].reshape((1,) + x_test[1].shape)
shape = X.shape
Run counterfactual:
[12]:
# define model
nn = load_model('nn_boston.h5')
# initialize explainer, fit and generate counterfactual
cf = CounterFactualProto(nn, shape, use_kdtree=True, theta=10., max_iterations=1000,
feature_range=(x_train.min(axis=0), x_train.max(axis=0)),
c_init=1., c_steps=10)
cf.fit(x_train)
explanation = cf.explain(X)
WARNING: Logging before flag parsing goes to stderr.
W0815 14:57:04.305606 140366215788352 cfproto.py:336] No encoder specified. Using kd trees to represent class prototypes.
The prediction flipped from 0 (value below the median) to 1 (above the median):
[13]:
print('Original prediction: {}'.format(explanation['orig_class']))
print('Counterfactual prediction: {}'.format(explanation['cf']['class']))
Original prediction: 0
Counterfactual prediction: 1
Let’s take a look at the counterfactual. To make the results more interpretable, we will first undo the preprocessing step and then check where the counterfactual differs from the original instance:
[14]:
orig = X * sigma + mu
counterfactual = explanation['cf']['X'] * sigma + mu
delta = counterfactual  orig
for i, f in enumerate(feature_names):
if np.abs(delta[0][i]) > 1e4:
print('{}: {}'.format(f, delta[0][i]))
AGE: 11.460830148972562
LSTAT: 5.1282056172858645
So in order to increase the house price, the proportion of owneroccupied units built prior to 1940 should decrease by ~1112%. This is not surprising since the proportion for the observation is very high at 93.6%. Furthermore, the % of the population with “lower status” should decrease by ~5%.
[15]:
print('% owneroccupied units built prior to 1940: {}'.format(orig[0][5]))
print('% lower status of the population: {}'.format(orig[0][11]))
% owneroccupied units built prior to 1940: 93.6
% lower status of the population: 18.68
Clean up:
[16]:
os.remove('nn_boston.h5')
This page was generated from examples/cfproto_cat_adult_ohe.ipynb.
Counterfactual explanations with onehot encoded categorical variables¶
Real world machine learning applications often handle data with categorical variables. Explanation methods which rely on perturbations of the input features need to make sure those perturbations are meaningful and capture the underlying structure of the data. This becomes tricky for categorical features. For instance random perturbations across possible categories or enforcing a ranking between categories based on frequency of occurrence in the training data do not capture this structure. Our method captures the relation between categories of a variable numerically through the context given by the other features in the data and/or the predictions made by the model. First it captures the pairwise distances between categories and then applies multidimensional scaling. More details about the method can be found in the documentation. The example notebook illustrates this approach on the adult dataset, which contains a mixture of categorical and numerical features used to predict whether a person’s income is above or below $50k.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
from sklearn.preprocessing import OneHotEncoder
from time import time
from alibi.datasets import fetch_adult
from alibi.explainers import CounterFactualProto
from alibi.utils.mapping import ohe_to_ord, ord_to_ohe
Load adult dataset¶
The fetch_adult
function returns a Bunch
object containing the features, the targets, the feature names and a mapping of the categories in each categorical variable.
[2]:
adult = fetch_adult()
data = adult.data
target = adult.target
feature_names = adult.feature_names
category_map_tmp = adult.category_map
target_names = adult.target_names
Define shuffled training and test set:
[3]:
def set_seed(s=0):
np.random.seed(s)
tf.set_random_seed(s)
[4]:
set_seed()
data_perm = np.random.permutation(np.c_[data, target])
X = data_perm[:,:1]
y = data_perm[:,1]
[5]:
idx = 30000
y_train, y_test = y[:idx], y[idx+1:]
Reorganize data so categorical features come first:
[6]:
X = np.c_[X[:, 1:8], X[:, 11], X[:, 0], X[:, 8:11]]
Adjust feature_names
and category_map
as well:
[7]:
feature_names = feature_names[1:8] + feature_names[11:12] + feature_names[0:1] + feature_names[8:11]
print(feature_names)
['Workclass', 'Education', 'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Country', 'Age', 'Capital Gain', 'Capital Loss', 'Hours per week']
[8]:
category_map = {}
for i, (_, v) in enumerate(category_map_tmp.items()):
category_map[i] = v
Create a dictionary with as keys the categorical columns and values the number of categories for each variable in the dataset:
[9]:
cat_vars_ord = {}
n_categories = len(list(category_map.keys()))
for i in range(n_categories):
cat_vars_ord[i] = len(np.unique(X[:, i]))
print(cat_vars_ord)
{0: 9, 1: 7, 2: 4, 3: 9, 4: 6, 5: 5, 6: 2, 7: 11}
Since we will apply onehot encoding (OHE) on the categorical variables, we convert cat_vars_ord
from the ordinal to OHE format. alibi.utils.mapping
contains utility functions to do this. The keys in cat_vars_ohe
now represent the first column index for each onehot encoded categorical variable. This dictionary will later be used in the counterfactual explanation.
[10]:
cat_vars_ohe = ord_to_ohe(X, cat_vars_ord)[1]
print(cat_vars_ohe)
{0: 9, 9: 7, 16: 4, 20: 9, 29: 6, 35: 5, 40: 2, 42: 11}
Preprocess data¶
Scale numerical features between 1 and 1:
[11]:
X_num = X[:, 4:].astype(np.float32, copy=False)
xmin, xmax = X_num.min(axis=0), X_num.max(axis=0)
rng = (1., 1.)
X_num_scaled = (X_num  xmin) / (xmax  xmin) * (rng[1]  rng[0]) + rng[0]
X_num_scaled_train = X_num_scaled[:idx, :]
X_num_scaled_test = X_num_scaled[idx+1:, :]
Apply OHE to categorical variables:
[12]:
X_cat = X[:, :4].copy()
ohe = OneHotEncoder(categories='auto')
ohe.fit(X_cat)
X_cat_ohe = ohe.transform(X_cat)
Combine numerical and categorical data:
[13]:
X = np.c_[X_cat_ohe.todense(), X_num_scaled].astype(np.float32, copy=False)
X_train, X_test = X[:idx, :], X[idx+1:, :]
print(X_train.shape, X_test.shape)
(30000, 57) (2560, 57)
Train neural net¶
[14]:
def nn_ohe():
x_in = Input(shape=(57,))
x = Dense(60, activation='relu')(x_in)
x = Dropout(.2)(x)
x = Dense(60, activation='relu')(x)
x = Dropout(.2)(x)
x = Dense(60, activation='relu')(x)
x = Dropout(.2)(x)
x_out = Dense(2, activation='softmax')(x)
nn = Model(inputs=x_in, outputs=x_out)
nn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return nn
[15]:
set_seed()
nn = nn_ohe()
nn.summary()
nn.fit(X_train, to_categorical(y_train), batch_size=256, epochs=20, verbose=0)
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 57)] 0
_________________________________________________________________
dense (Dense) (None, 60) 3480
_________________________________________________________________
dropout (Dropout) (None, 60) 0
_________________________________________________________________
dense_1 (Dense) (None, 60) 3660
_________________________________________________________________
dropout_1 (Dropout) (None, 60) 0
_________________________________________________________________
dense_2 (Dense) (None, 60) 3660
_________________________________________________________________
dropout_2 (Dropout) (None, 60) 0
_________________________________________________________________
dense_3 (Dense) (None, 2) 122
=================================================================
Total params: 10,922
Trainable params: 10,922
Nontrainable params: 0
_________________________________________________________________
[15]:
<tensorflow.python.keras.callbacks.History at 0x7fa11671d080>
Generate counterfactual¶
Original instance:
[16]:
X = X_test[0].reshape((1,) + X_test[0].shape)
Initialize counterfactual parameters. The feature perturbations are applied in the numerical feature space, after transforming the categorical variables to numerical features. As a result, the dimensionality and values of feature_range
are defined in the numerical space.
[17]:
shape = X.shape
beta = .01
c_init = 1.
c_steps = 5
max_iterations = 500
rng = (1., 1.) # scale features between 1 and 1
rng_shape = (1,) + data.shape[1:]
feature_range = ((np.ones(rng_shape) * rng[0]).astype(np.float32),
(np.ones(rng_shape) * rng[1]).astype(np.float32))
Initialize explainer:
[18]:
def set_seed(s=0):
np.random.seed(s)
tf.set_random_seed(s)
[19]:
set_seed()
cf = CounterFactualProto(nn,
shape,
beta=beta,
cat_vars=cat_vars_ohe,
ohe=True, # OHE flag
max_iterations=max_iterations,
feature_range=feature_range,
c_init=c_init,
c_steps=c_steps
)
Fit explainer. d_type
refers to the distance metric used to convert the categorical to numerical values. Valid options are abdm
, mvdm
and abdmmvdm
. abdm
infers the distance between categories of the same variable from the context provided by the other variables. This requires binning of the numerical features as well. mvdm
computes the distance using the model predictions, and abdmmvdm
combines both methods. More info on both distance measures can be found in the
documentation.
[20]:
cf.fit(X_train, d_type='abdm', disc_perc=[25, 50, 75])
We can now visualize the transformation from the categorical to numerical values for each category. The example below shows that the Education feature is ordered from High School Dropout to having obtained a Doctorate degree. As a result, if we perturb an instance representing a person that has obtained a Bachelors degree, the nearest perturbations will result in a counterfactual instance with either a Masters or an Associates degree.
[21]:
def plot_bar(dist, cols, figsize=(10,4)):
dist = dist.reshape(dist.shape[0])
idx = np.argsort(dist)
fig, ax = plt.subplots(figsize=figsize)
plt.bar(cols[idx], dist[idx])
print(cols[idx])
[22]:
cat = 'Education'
idx = feature_names.index(cat)
plot_bar(cf.d_abs[idx], np.array(category_map[idx]), figsize=(20,4))
['Dropout' 'High School grad' 'Associates' 'Bachelors' 'Masters'
'ProfSchool' 'Doctorate']
Explain instance:
[23]:
explanation = cf.explain(X)
Helper function to more clearly describe explanations:
[24]:
def describe_instance(X, explanation, eps=1e2):
print('Original instance: {}  proba: {}'.format(target_names[explanation['orig_class']],
explanation['orig_proba'][0]))
print('Counterfactual instance: {}  proba: {}'.format(target_names[explanation['cf']['class']],
explanation['cf']['proba'][0]))
print('\nCounterfactual perturbations...')
print('\nCategorical:')
X_orig_ord = ohe_to_ord(X, cat_vars_ohe)[0]
X_cf_ord = ohe_to_ord(explanation['cf']['X'], cat_vars_ohe)[0]
delta_cat = {}
for i, (_, v) in enumerate(category_map.items()):
cat_orig = v[int(X_orig_ord[0, i])]
cat_cf = v[int(X_cf_ord[0, i])]
if cat_orig != cat_cf:
delta_cat[feature_names[i]] = [cat_orig, cat_cf]
if delta_cat:
for k, v in delta_cat.items():
print('{}: {} > {}'.format(k, v[0], v[1]))
print('\nNumerical:')
delta_num = X_cf_ord[0, 4:]  X_orig_ord[0, 4:]
n_keys = len(list(cat_vars_ord.keys()))
for i in range(delta_num.shape[1]):
if np.abs(delta_num[0, i]) > eps:
print('{}: {:.2f} > {:.2f}'.format(feature_names[i+n_keys],
X_orig_ord[0,i+n_keys],
X_cf_ord[0,i+n_keys]))
[25]:
describe_instance(X, explanation)
Original instance: <=50K  proba: [0.8058248 0.19417518]
Counterfactual instance: >50K  proba: [0.45429865 0.5457013 ]
Counterfactual perturbations...
Categorical:
Education: Associates > Masters
Numerical:
Capital Gain: 1.00 > 0.97
By obtaining a higher level of education and increasing capital gain slightly the income is predicted to be above $50k.
Change the categorical distance metric¶
Instead of abdm
, we now use mvdm
as our distance metric.
[26]:
set_seed()
cf.fit(X_train, d_type='mvdm')
explanation = cf.explain(X)
describe_instance(X, explanation)
Original instance: <=50K  proba: [0.8058248 0.19417518]
Counterfactual instance: >50K  proba: [0.34000358 0.65999645]
Counterfactual perturbations...
Categorical:
Education: Associates > Masters
Numerical:
Capital Gain: 1.00 > 0.93
The same conclusion (increase the level of education and capital gain) can be drawn with the different distance metric.
Use kd trees to build prototypes¶
We can also use kd trees to build class prototypes to guide the counterfactual to nearby instances in the counterfactual class as described in Interpretable Counterfactual Explanations Guided by Prototypes.
[27]:
use_kdtree = True
theta = 10. # weight of prototype loss term
Initialize, fit and explain instance:
[28]:
set_seed()
X = X_test[7].reshape((1,) + X_test[0].shape)
cf = CounterFactualProto(nn,
shape,
beta=beta,
theta=theta,
cat_vars=cat_vars_ohe,
ohe=True,
use_kdtree=use_kdtree,
max_iterations=max_iterations,
feature_range=feature_range,
c_init=c_init,
c_steps=c_steps
)
cf.fit(X_train, d_type='abdm')
explanation = cf.explain(X)
describe_instance(X, explanation)
No encoder specified. Using kd trees to represent class prototypes.
Original instance: <=50K  proba: [0.5573588 0.4426412]
Counterfactual instance: >50K  proba: [0.49765858 0.50234145]
Counterfactual perturbations...
Categorical:
Numerical:
Age: 0.53 > 0.49
By slightly increasing the age of the person the income would be predicted to be above $50k.
Use an autoencoder to build prototypes¶
Another option is to use an autoencoder to guide the perturbed instance to the counterfactual class. We define and train the autoencoder:
[29]:
def ae_model():
# encoder
x_in = Input(shape=(57,))
x = Dense(60, activation='relu')(x_in)
x = Dense(30, activation='relu')(x)
x = Dense(15, activation='relu')(x)
encoded = Dense(10, activation=None)(x)
encoder = Model(x_in, encoded)
# decoder
dec_in = Input(shape=(10,))
x = Dense(15, activation='relu')(dec_in)
x = Dense(30, activation='relu')(x)
x = Dense(60, activation='relu')(x)
decoded = Dense(57, activation=None)(x)
decoder = Model(dec_in, decoded)
# autoencoder = encoder + decoder
x_out = decoder(encoder(x_in))
autoencoder = Model(x_in, x_out)
autoencoder.compile(optimizer='adam', loss='mse')
return autoencoder, encoder, decoder
[30]:
set_seed()
ae, enc, dec = ae_model()
ae.summary()
ae.fit(X_train, X_train, batch_size=128, epochs=100, validation_data=(X_test, X_test), verbose=0)
Model: "model_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 57)] 0
_________________________________________________________________
model_1 (Model) (None, 10) 5935
_________________________________________________________________
model_2 (Model) (None, 57) 5982
=================================================================
Total params: 11,917
Trainable params: 11,917
Nontrainable params: 0
_________________________________________________________________
[30]:
<tensorflow.python.keras.callbacks.History at 0x7fa0bc3e47f0>
Weights for the autoencoder and prototype loss terms:
[31]:
beta = .1 # L1
gamma = 10. # autoencoder
theta = .1 # prototype
Initialize, fit and explain instance:
[32]:
set_seed()
X = X_test[19].reshape((1,) + X_test[0].shape)
cf = CounterFactualProto(nn,
shape,
beta=beta,
enc_model=enc,
ae_model=ae,
gamma=gamma,
theta=theta,
cat_vars=cat_vars_ohe,
ohe=True,
max_iterations=max_iterations,
feature_range=feature_range,
c_init=c_init,
c_steps=c_steps
)
cf.fit(X_train, d_type='abdm')
explanation = cf.explain(X)
describe_instance(X, explanation)
Original instance: >50K  proba: [0.49551004 0.5044899 ]
Counterfactual instance: <=50K  proba: [0.67236006 0.32763994]
Counterfactual perturbations...
Categorical:
Occupation: Other > BlueCollar
Numerical:
Black box model with kd trees¶
Now we assume that we only have access to the model’s prediction function and treat it as a black box. The kd trees are again used to define the prototypes.
[33]:
use_kdtree = True
theta = 10. # weight of prototype loss term
Initialize, fit and explain instance:
[34]:
set_seed()
X = X_test[24].reshape((1,) + X_test[0].shape)
# define predict function
predict_fn = lambda x: nn.predict(x)
cf = CounterFactualProto(predict_fn,
shape,
beta=beta,
theta=theta,
cat_vars=cat_vars_ohe,
ohe=True,
use_kdtree=use_kdtree,
max_iterations=max_iterations,
feature_range=feature_range,
c_init=c_init,
c_steps=c_steps
)
cf.fit(X_train, d_type='abdm')
explanation = cf.explain(X)
describe_instance(X, explanation)
No encoder specified. Using kd trees to represent class prototypes.
Original instance: >50K  proba: [0.23185147 0.7681486 ]
Counterfactual instance: <=50K  proba: [0.5012684 0.4987316]
Counterfactual perturbations...
Categorical:
Numerical:
Age: 0.15 > 0.37
Hours per week: 0.20 > 0.25
If the person was younger and worked less, he or she would have a predicted income below $50k.
This page was generated from examples/cfproto_cat_adult_ord.ipynb.
Counterfactual explanations with ordinally encoded categorical variables¶
This example notebook illustrates how to obtain counterfactual explanations for instances with a mixture of ordinally encoded categorical and numerical variables. A more elaborate notebook highlighting additional functionality can be found here. We generate counterfactuals for instances in the adult dataset where we predict whether a person’s income is above or below $50k.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Dense, Input, Embedding, Concatenate, Reshape, Dropout, Lambda
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
from sklearn.preprocessing import OneHotEncoder
from time import time
from alibi.datasets import fetch_adult
from alibi.explainers import CounterFactualProto
Load adult dataset¶
The fetch_adult
function returns a Bunch
object containing the features, the targets, the feature names and a mapping of the categories in each categorical variable.
[2]:
adult = fetch_adult()
data = adult.data
target = adult.target
feature_names = adult.feature_names
category_map_tmp = adult.category_map
target_names = adult.target_names
Define shuffled training and test set:
[3]:
def set_seed(s=0):
np.random.seed(s)
tf.set_random_seed(s)
[4]:
set_seed()
data_perm = np.random.permutation(np.c_[data, target])
X = data_perm[:,:1]
y = data_perm[:,1]
[5]:
idx = 30000
y_train, y_test = y[:idx], y[idx+1:]
Reorganize data so categorical features come first:
[6]:
X = np.c_[X[:, 1:8], X[:, 11], X[:, 0], X[:, 8:11]]
Adjust feature_names
and category_map
as well:
[7]:
feature_names = feature_names[1:8] + feature_names[11:12] + feature_names[0:1] + feature_names[8:11]
print(feature_names)
['Workclass', 'Education', 'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Country', 'Age', 'Capital Gain', 'Capital Loss', 'Hours per week']
[8]:
category_map = {}
for i, (_, v) in enumerate(category_map_tmp.items()):
category_map[i] = v
Create a dictionary with as keys the categorical columns and values the number of categories for each variable in the dataset. This dictionary will later be used in the counterfactual explanation.
[9]:
cat_vars_ord = {}
n_categories = len(list(category_map.keys()))
for i in range(n_categories):
cat_vars_ord[i] = len(np.unique(X[:, i]))
print(cat_vars_ord)
{0: 9, 1: 7, 2: 4, 3: 9, 4: 6, 5: 5, 6: 2, 7: 11}
Preprocess data¶
Scale numerical features between 1 and 1:
[10]:
X_num = X[:, 4:].astype(np.float32, copy=False)
xmin, xmax = X_num.min(axis=0), X_num.max(axis=0)
rng = (1., 1.)
X_num_scaled = (X_num  xmin) / (xmax  xmin) * (rng[1]  rng[0]) + rng[0]
X_num_scaled_train = X_num_scaled[:idx, :]
X_num_scaled_test = X_num_scaled[idx+1:, :]
Combine numerical and categorical data:
[11]:
X = np.c_[X[:, :4], X_num_scaled].astype(np.float32, copy=False)
X_train, X_test = X[:idx, :], X[idx+1:, :]
print(X_train.shape, X_test.shape)
(30000, 12) (2560, 12)
Train a neural net¶
The neural net will use entity embeddings for the categorical variables.
[12]:
def nn_ord():
x_in = Input(shape=(12,))
layers_in = []
# embedding layers
for i, (_, v) in enumerate(cat_vars_ord.items()):
emb_in = Lambda(lambda x: x[:, i:i+1])(x_in)
emb_dim = int(max(min(np.ceil(.5 * v), 50), 2))
emb_layer = Embedding(input_dim=v+1, output_dim=emb_dim, input_length=1)(emb_in)
emb_layer = Reshape(target_shape=(emb_dim,))(emb_layer)
layers_in.append(emb_layer)
# numerical layers
num_in = Lambda(lambda x: x[:, 4:])(x_in)
num_layer = Dense(16)(num_in)
layers_in.append(num_layer)
# combine
x = Concatenate()(layers_in)
x = Dense(60, activation='relu')(x)
x = Dropout(.2)(x)
x = Dense(60, activation='relu')(x)
x = Dropout(.2)(x)
x = Dense(60, activation='relu')(x)
x = Dropout(.2)(x)
x_out = Dense(2, activation='softmax')(x)
nn = Model(inputs=x_in, outputs=x_out)
nn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return nn
[13]:
set_seed()
nn = nn_ord()
nn.summary()
nn.fit(X_train, to_categorical(y_train), batch_size=128, epochs=20, verbose=0)
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 12)] 0
__________________________________________________________________________________________________
lambda (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_1 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_2 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_3 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_4 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_5 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_6 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_7 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
embedding (Embedding) (None, 1, 5) 50 lambda[0][0]
__________________________________________________________________________________________________
embedding_1 (Embedding) (None, 1, 4) 32 lambda_1[0][0]
__________________________________________________________________________________________________
embedding_2 (Embedding) (None, 1, 2) 10 lambda_2[0][0]
__________________________________________________________________________________________________
embedding_3 (Embedding) (None, 1, 5) 50 lambda_3[0][0]
__________________________________________________________________________________________________
embedding_4 (Embedding) (None, 1, 3) 21 lambda_4[0][0]
__________________________________________________________________________________________________
embedding_5 (Embedding) (None, 1, 3) 18 lambda_5[0][0]
__________________________________________________________________________________________________
embedding_6 (Embedding) (None, 1, 2) 6 lambda_6[0][0]
__________________________________________________________________________________________________
embedding_7 (Embedding) (None, 1, 6) 72 lambda_7[0][0]
__________________________________________________________________________________________________
lambda_8 (Lambda) (None, 4) 0 input_1[0][0]
__________________________________________________________________________________________________
reshape (Reshape) (None, 5) 0 embedding[0][0]
__________________________________________________________________________________________________
reshape_1 (Reshape) (None, 4) 0 embedding_1[0][0]
__________________________________________________________________________________________________
reshape_2 (Reshape) (None, 2) 0 embedding_2[0][0]
__________________________________________________________________________________________________
reshape_3 (Reshape) (None, 5) 0 embedding_3[0][0]
__________________________________________________________________________________________________
reshape_4 (Reshape) (None, 3) 0 embedding_4[0][0]
__________________________________________________________________________________________________
reshape_5 (Reshape) (None, 3) 0 embedding_5[0][0]
__________________________________________________________________________________________________
reshape_6 (Reshape) (None, 2) 0 embedding_6[0][0]
__________________________________________________________________________________________________
reshape_7 (Reshape) (None, 6) 0 embedding_7[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 16) 80 lambda_8[0][0]
__________________________________________________________________________________________________
concatenate (Concatenate) (None, 46) 0 reshape[0][0]
reshape_1[0][0]
reshape_2[0][0]
reshape_3[0][0]
reshape_4[0][0]
reshape_5[0][0]
reshape_6[0][0]
reshape_7[0][0]
dense[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 60) 2820 concatenate[0][0]
__________________________________________________________________________________________________
dropout (Dropout) (None, 60) 0 dense_1[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 60) 3660 dropout[0][0]
__________________________________________________________________________________________________
dropout_1 (Dropout) (None, 60) 0 dense_2[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (None, 60) 3660 dropout_1[0][0]
__________________________________________________________________________________________________
dropout_2 (Dropout) (None, 60) 0 dense_3[0][0]
__________________________________________________________________________________________________
dense_4 (Dense) (None, 2) 122 dropout_2[0][0]
==================================================================================================
Total params: 10,601
Trainable params: 10,601
Nontrainable params: 0
__________________________________________________________________________________________________
[13]:
<tensorflow.python.keras.callbacks.History at 0x7fd4e848cda0>
Generate counterfactual¶
Original instance:
[14]:
X = X_test[0].reshape((1,) + X_test[0].shape)
Initialize counterfactual parameters:
[15]:
shape = X.shape
beta = .01
c_init = 1.
c_steps = 5
max_iterations = 500
rng = (1., 1.) # scale features between 1 and 1
rng_shape = (1,) + data.shape[1:]
feature_range = ((np.ones(rng_shape) * rng[0]).astype(np.float32),
(np.ones(rng_shape) * rng[1]).astype(np.float32))
Initialize explainer. Since the Embedding
layers in tf.keras
do not let gradients propagate through, we will only make use of the model’s predict function, treat it as a black box and perform numerical gradient calculations.
[16]:
set_seed()
# define predict function
predict_fn = lambda x: nn.predict(x)
cf = CounterFactualProto(predict_fn,
shape,
beta=beta,
cat_vars=cat_vars_ord,
max_iterations=max_iterations,
feature_range=feature_range,
c_init=c_init,
c_steps=c_steps,
eps=(.01, .01) # perturbation size for numerical gradients
)
Fit explainer. Please check the documentation for more info about the optional arguments.
[17]:
cf.fit(X_train, d_type='abdm', disc_perc=[25, 50, 75])
Explain instance:
[18]:
set_seed()
explanation = cf.explain(X)
Helper function to more clearly describe explanations:
[19]:
def describe_instance(X, explanation, eps=1e2):
print('Original instance: {}  proba: {}'.format(target_names[explanation['orig_class']],
explanation['orig_proba'][0]))
print('Counterfactual instance: {}  proba: {}'.format(target_names[explanation['cf']['class']],
explanation['cf']['proba'][0]))
print('\nCounterfactual perturbations...')
print('\nCategorical:')
X_orig_ord = X
X_cf_ord = explanation['cf']['X']
delta_cat = {}
for i, (_, v) in enumerate(category_map.items()):
cat_orig = v[int(X_orig_ord[0, i])]
cat_cf = v[int(X_cf_ord[0, i])]
if cat_orig != cat_cf:
delta_cat[feature_names[i]] = [cat_orig, cat_cf]
if delta_cat:
for k, v in delta_cat.items():
print('{}: {} > {}'.format(k, v[0], v[1]))
print('\nNumerical:')
delta_num = X_cf_ord[0, 4:]  X_orig_ord[0, 4:]
n_keys = len(list(cat_vars_ord.keys()))
for i in range(delta_num.shape[0]):
if np.abs(delta_num[i]) > eps:
print('{}: {:.2f} > {:.2f}'.format(feature_names[i+n_keys],
X_orig_ord[0,i+n_keys],
X_cf_ord[0,i+n_keys]))
[20]:
describe_instance(X, explanation)
Original instance: <=50K  proba: [0.8347201 0.16527994]
Counterfactual instance: >50K  proba: [0.47529536 0.52470464]
Counterfactual perturbations...
Categorical:
Occupation: WhiteCollar > Professional
Numerical:
Capital Gain: 1.00 > 0.87
The person’s incomce is predicted to be above $50k by increasing his or her capital gain and changing occupation to Professional.
This page was generated from examples/linearity_measure_iris.ipynb.
Linearity measure applied to Iris¶
General definition¶
The model linearity module in alibi provides metric to measure how linear an ML model is. Linearity is defined based on how much the linear superposition of the model’s outputs differs from the output of the same linear superposition of the inputs.
Given \(N\) input vectors \(v_i\), \(N\) real coefficients \(\alpha_i\) and a predict function \(\text{M}(v_i)\), the linearity of the predict function is defined as
Note that a lower value of \(L\) means that the model \(M\) is more linear.
Alibi implementation¶
Based on the general definition above, alibi calculates the linearity of a model in the neighboorhood of a given instance \(v_0\).
Iris Data set¶
As an example, we will visualize the decision boundaries and the values of the linearity measure for various classifier on the iris dataset. Only 2 features are included for visualization porpuses.
[2]:
import pandas as pd
import numpy as np
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from itertools import product
from alibi.confidence import linearity_measure, LinearityMeasure
Models¶
We will experiment with 5 different classifiers: * A logistic regression model, which is expected to be highly linear. * A random forest classifier, which is expected to be higly nonlinear. * An xgboost classifier. * A support vector machine classifier. * A feed forward neural network
[5]:
lr = LogisticRegression(fit_intercept=False, multi_class='multinomial', solver='newtoncg')
rf = RandomForestClassifier(n_estimators=100)
xgb = XGBClassifier(n_estimators=100)
svm = SVC(gamma=.1, kernel='rbf', probability=True)
nn = MLPClassifier(hidden_layer_sizes=(100,50), activation='relu', max_iter=1000)
[6]:
lr.fit(X_train, y_train)
rf.fit(X_train, y_train)
xgb.fit(X_train, y_train)
svm.fit(X_train, y_train)
nn.fit(X_train, y_train);
Decision boundaries and linearity¶
[7]:
# Creating a grid
x_min, x_max = X_train[:, 0].min()  1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min()  1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
[8]:
# Flattening points in the grid
X = np.empty((len(xx.flatten()), 2))
for i in range(xx.shape[0]):
for j in range(xx.shape[1]):
k = i * xx.shape[1] + j
X[k] = np.array([xx[i, j], yy[i, j]])
Logistic regression¶
[9]:
# Defining predict function for logistic regression
clf = lr
predict_fn = lambda x: clf.predict_proba(x)
[10]:
# Calculating linearity for all points in the grid
lm = LinearityMeasure(agg='pairwise')
lm.fit(X_train)
L = lm.score(predict_fn, X)
L = L.reshape(xx.shape)
lins_dict['LR'] = L.mean()
[11]:
# Visualising decision boundaries and linearity values
f, axarr = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16, 8))
idx = (0,0)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axarr[0].contourf(xx, yy, Z, alpha=0.4)
axarr[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=20, edgecolor='k', alpha=1)
axarr[0].set_title('Decision boundaries', fontsize=20)
axarr[0].set_xlabel('sepal length (cm)',fontsize=18)
axarr[0].set_ylabel('sepal width (cm)', fontsize=18)
LPL = axarr[1].contourf(xx, yy, L, alpha=0.8, cmap='Greys')
axarr[1].set_title('Model linearity', fontsize=20)
axarr[1].set_xlabel('sepal length (cm)', fontsize=18)
axarr[1].set_ylabel('sepal width (cm)', fontsize=18)
cbar = f.colorbar(LPL)
#cbar.ax.set_ylabel('Linearity')
plt.show()
print('Decision boundaries (left panel) and linearity measure (right panel) for a logistic regression (LG) classifier in feature space. The x and y axis in the plots represent the sepal length and the sepal width, respectively. Different colours correspond to different predicted classes. The markers represents the data points in the training set.')
print('Maximum value model linearity: {}'. format(np.round(L.max(), 5)))
print('Minimum value model linearity: {}'.format(np.round(L.min(),5)))
Decision boundaries (left panel) and linearity measure (right panel) for a logistic regression (LG) classifier in feature space. The x and y axis in the plots represent the sepal length and the sepal width, respectively. Different colours correspond to different predicted classes. The markers represents the data points in the training set.
Maximum value model linearity: 0.01841
Minimum value model linearity: 0.0
Random forest¶
[12]:
# Defining predict function for random forest
clf = rf
predict_fn = lambda x: clf.predict_proba(x)
[13]:
# Calculating linearity for all points in the grid
lm = LinearityMeasure(agg='pairwise')
lm.fit(X_train)
L = lm.score(predict_fn, X)
L = L.reshape(xx.shape)
lins_dict['RF'] = L.mean()
[14]:
# Visualising decision boundaries and linearity values
f, axarr = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16, 8))
idx = (0,0)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axarr[0].contourf(xx, yy, Z, alpha=0.4)
axarr[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=20, edgecolor='k', alpha=1)
axarr[0].set_title('Decision boundaries', fontsize=20)
axarr[0].set_xlabel('sepal length (cm)', fontsize=18)
axarr[0].set_ylabel('sepal width (cm)', fontsize=18)
LPL = axarr[1].contourf(xx, yy, L, alpha=0.8, cmap='Greys')
axarr[1].set_title('Model linearity', fontsize=20)
axarr[1].set_xlabel('sepal length (cm)', fontsize=18)
axarr[1].set_ylabel('sepal width (cm)', fontsize=18)
cbar = f.colorbar(LPL)
plt.show()
print('Decision boundaries (left panel) and linearity measure (right panel) for a random forest (RF) classifier in feature space. The x and y axis in the plots represent the sepal length and the sepal width, respectively. Different colours correspond to different predicted classes. The markers represents the data points in the training set.')
print('Maximum value model linearity: {}'. format(np.round(L.max(), 5)))
print('Minimum value model linearity: {}'.format(np.round(L.min(),5)))
Decision boundaries (left panel) and linearity measure (right panel) for a random forest (RF) classifier in feature space. The x and y axis in the plots represent the sepal length and the sepal width, respectively. Different colours correspond to different predicted classes. The markers represents the data points in the training set.
Maximum value model linearity: 12.07288
Minimum value model linearity: 0.0
Xgboost¶
[15]:
# Defining predict function for xgboost
clf = xgb
predict_fn = lambda x: clf.predict_proba(x)
[16]:
# Calculating linearity for all points in the grid
lm = LinearityMeasure(agg='pairwise')
lm.fit(X_train)
L = lm.score(predict_fn, X)
L = L.reshape(xx.shape)
lins_dict['XB'] = L.mean()
[17]:
# Visualising decision boundaries and linearity values
f, axarr = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16, 8))
idx = (0,0)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axarr[0].contourf(xx, yy, Z, alpha=0.4)
axarr[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=20, edgecolor='k', alpha=1)
axarr[0].set_title('Decision boundaries', fontsize=20)
axarr[0].set_xlabel('sepal length (cm)', fontsize=20)
axarr[0].set_ylabel('sepal width (cm)', fontsize=20)
LPL = axarr[1].contourf(xx, yy, L, alpha=0.8, cmap='Greys')
axarr[1].set_title('L measure', fontsize=20)
axarr[1].set_xlabel('sepal length (cm)', fontsize=20)
axarr[1].set_ylabel('sepal width (cm)', fontsize=20)
cbar = f.colorbar(LPL)
#cbar.ax.set_ylabel('Linearity')
plt.show()
print('Decision boundaries (left panel) and linearity measure (right panel) for a xgboost (XB) classifier in feature space. The x and y axis in the plots represent the sepal length and the sepal width, respectively. Different colours correspond to different predicted classes. The markers represents the data points in the training set.')
print('Maximum value model linearity: {}'. format(np.round(L.max(), 5)))
print('Minimum value model linearity: {}'.format(np.round(L.min(),5)))
Decision boundaries (left panel) and linearity measure (right panel) for a xgboost (XB) classifier in feature space. The x and y axis in the plots represent the sepal length and the sepal width, respectively. Different colours correspond to different predicted classes. The markers represents the data points in the training set.
Maximum value model linearity: 1.42648
Minimum value model linearity: 0.0
SVM¶
[18]:
# Defining predict function for svm
clf = svm
predict_fn = lambda x: clf.predict_proba(x)
[19]:
# Calculating linearity for all points in the grid
lm = LinearityMeasure(agg='pairwise')
lm.fit(X_train)
L = lm.score(predict_fn, X)
L = L.reshape(xx.shape)
lins_dict['SM'] = L.mean()
[20]:
# Visualising decision boundaries and linearity values
f, axarr = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16, 8))
idx = (0,0)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axarr[0].contourf(xx, yy, Z, alpha=0.4)
axarr[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=20, edgecolor='k', alpha=1)
axarr[0].set_title('Decision boundaries', fontsize=20)
axarr[0].set_xlabel('sepal length (cm)', fontsize=18)
axarr[0].set_ylabel('sepal width (cm)', fontsize=18)
LPL = axarr[1].contourf(xx, yy, L, alpha=0.8, cmap='Greys')
axarr[1].set_title('Model linearity', fontsize=20)
axarr[1].set_xlabel('sepal length (cm)', fontsize=18)
axarr[1].set_ylabel('sepal width (cm)', fontsize=18)
cbar = f.colorbar(LPL)
#cbar.ax.set_ylabel('Linearity')
plt.show()
print('Decision boundaries (left panel) and linearity measure (right panel) for a support vector machine (SM) classifier in feature space. The x and y axis in the plots represent the sepal length and the sepal width, respectively. Different colours correspond to different predicted classes. The markers represents the data points in the training set.')
print('Maximum value model linearity: {}'. format(np.round(L.max(), 5)))
print('Minimum value model linearity: {}'.format(np.round(L.min(),5)))
Decision boundaries (left panel) and linearity measure (right panel) for a support vector machine (SM) classifier in feature space. The x and y axis in the plots represent the sepal length and the sepal width, respectively. Different colours correspond to different predicted classes. The markers represents the data points in the training set.
Maximum value model linearity: 0.45113
Minimum value model linearity: 0.00083
NN¶
[21]:
# Defining predict function for svm
clf = nn
predict_fn = lambda x: clf.predict_proba(x)
[22]:
# Calculating linearity for all points in the grid
lm = LinearityMeasure(agg='pairwise')
lm.fit(X_train)
L = lm.score(predict_fn, X)
L = L.reshape(xx.shape)
lins_dict['NN'] = L.mean()
[23]:
# Visualising decision boundaries and linearity values
f, axarr = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16, 8))
idx = (0,0)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axarr[0].contourf(xx, yy, Z, alpha=0.4)
axarr[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=20, edgecolor='k', alpha=1)
axarr[0].set_title('Decision boundaries', fontsize=20)
axarr[0].set_xlabel('sepal length (cm)', fontsize=18)
axarr[0].set_ylabel('sepal width (cm)', fontsize=18)
LPL = axarr[1].contourf(xx, yy, L, alpha=0.8, cmap='Greys')
axarr[1].set_title('Model linearity', fontsize=20)
axarr[1].set_xlabel('sepal length (cm)', fontsize=18)
axarr[1].set_ylabel('sepal width (cm)', fontsize=18)
cbar = f.colorbar(LPL)
#cbar.ax.set_ylabel('Linearity')
plt.show()
print('Decision boundaries (left panel) and linearity measure (right panel) for a feed forward neural network classifier (NN) classifier in feature space. The x and y axis in the plots represent the sepal length and the sepal width, respectively. Different colours correspond to different predicted classes. The markers represents the data points in the training set.')
print('Maximum value model linearity: {}'. format(np.round(L.max(), 5)))
print('Minimum value model linearity: {}'.format(np.round(L.min(),5)))
Decision boundaries (left panel) and linearity measure (right panel) for a feed forward neural network classifier (NN) classifier in feature space. The x and y axis in the plots represent the sepal length and the sepal width, respectively. Different colours correspond to different predicted classes. The markers represents the data points in the training set.
Maximum value model linearity: 0.11615
Minimum value model linearity: 3e05
Average linearity over the whole feature space¶
[24]:
ax = pd.Series(data=lins_dict).sort_values().plot(kind='barh', figsize=(10,5), fontsize=20, color='dimgray',
width=0.8, logx=True)
ax.set_xlabel('L measure (log scale)', fontsize=20)
print('Comparison of the linearity measure L averaged over the whole feature space for various models trained on the iris dataset: random forest (RF), xgboost (XB), support vector machine (SM), neural network (NN) and logistic regression (LR). Note that the scale of the X axis is logarithmic.')
Comparison of the linearity measure L averaged over the whole feature space for various models trained on the iris dataset: random forest (RF), xgboost (XB), support vector machine (SM), neural network (NN) and logistic regression (LR). Note that the scale of the X axis is logarithmic.
This page was generated from examples/linearity_measure_fashion_mnist.ipynb.
Linearity measure applied to fashion MNIST¶
General definition¶
The model linearity module in alibi provides metric to measure how linear an ML model is. Linearity is defined based on how much the linear superposition of the model’s outputs differs from the output of the same linear superposition of the inputs.
Given \(N\) input vectors \(v_i\), \(N\) real coefficients \(\alpha_i\) and a predict function \(\text{M}(v_i)\), the linearity of the predict function is defined as
Note that a lower value of \(L\) means that the model \(M\) is more linear.
Alibi implementation¶
Based on the general definition above, alibi calculates the linearity of a model in the neighboorhood of a given instance \(v_0\).
Fashion MNIST data set¶
We train a convolutional neural network to classify the images in the fashion MNIST dataset.
We investigate the correlation between the model’s linearity associated to a certain instance and the class the instance belong to.
We also calculate the linearity measure for each internal layer of the CNN and show how linearity propagates through the model.
[1]:
import pandas as pd
import numpy as np
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
from time import time
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from alibi.confidence.model_linearity import linearity_measure, LinearityMeasure
from alibi.confidence.model_linearity import _infer_feature_range
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Input, Activation
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import backend as K
Load data fashion mnist¶
The fashion MNIST data set consists of 60000 images of shape \(28 \times 28\) divided in 10 categories. Each category corresponds to a different type of clothing piece, such as “boots”, “tshirts”, etc
[3]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
print('x_train shape:', x_train.shape, 'y_train shape:', y_train.shape)
x_train shape: (60000, 28, 28) y_train shape: (60000,)
[4]:
idx = 0
plt.imshow(x_train[idx])
print('Sample instance from the MNIST data set.')
Sample instance from the MNIST data set.
[5]:
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = np.reshape(x_train, x_train.shape + (1,))
x_test = np.reshape(x_test, x_test.shape + (1,))
print('x_train shape:', x_train.shape, 'x_test shape:', x_test.shape)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print('y_train shape:', y_train.shape, 'y_test shape:', y_test.shape)
x_train shape: (60000, 28, 28, 1) x_test shape: (10000, 28, 28, 1)
y_train shape: (60000, 10) y_test shape: (10000, 10)
Convolutional neural network¶
Here we define and train a 2 layer convolutional neural network on the fashion MNIST data set.
Define model¶
[6]:
def model():
x_in = Input(shape=(28, 28, 1), name='input')
x = Conv2D(filters=64, kernel_size=2, padding='same', name='conv_1')(x_in)
x = Activation('relu', name='relu_1')(x)
x = MaxPooling2D(pool_size=2, name='maxp_1')(x)
x = Dropout(0.3, name='drop_1')(x)
x = Conv2D(filters=64, kernel_size=2, padding='same', name='conv_2')(x)
x = Activation('relu', name='relu_2')(x)
x = MaxPooling2D(pool_size=2, name='maxp_2')(x)
x = Dropout(0.3, name='drop_2')(x)
x = Flatten(name='flat')(x)
x = Dense(256, name='dense_1')(x)
x = Activation('relu', name='relu_3')(x)
x = Dropout(0.5, name='drop_3')(x)
x_out = Dense(10, name='dense_2')(x)
x_out = Activation('softmax', name='softmax')(x_out)
cnn = Model(inputs=x_in, outputs=x_out)
cnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return cnn
[7]:
cnn = model()
cnn.summary()
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) [(None, 28, 28, 1)] 0
_________________________________________________________________
conv_1 (Conv2D) (None, 28, 28, 64) 320
_________________________________________________________________
relu_1 (Activation) (None, 28, 28, 64) 0
_________________________________________________________________
maxp_1 (MaxPooling2D) (None, 14, 14, 64) 0
_________________________________________________________________
drop_1 (Dropout) (None, 14, 14, 64) 0
_________________________________________________________________
conv_2 (Conv2D) (None, 14, 14, 64) 16448
_________________________________________________________________
relu_2 (Activation) (None, 14, 14, 64) 0
_________________________________________________________________
maxp_2 (MaxPooling2D) (None, 7, 7, 64) 0
_________________________________________________________________
drop_2 (Dropout) (None, 7, 7, 64) 0
_________________________________________________________________
flat (Flatten) (None, 3136) 0
_________________________________________________________________
dense_1 (Dense) (None, 256) 803072
_________________________________________________________________
relu_3 (Activation) (None, 256) 0
_________________________________________________________________
drop_3 (Dropout) (None, 256) 0
_________________________________________________________________
dense_2 (Dense) (None, 10) 2570
_________________________________________________________________
softmax (Activation) (None, 10) 0
=================================================================
Total params: 822,410
Trainable params: 822,410
Nontrainable params: 0
_________________________________________________________________
Training¶
[8]:
cnn.fit(x_train, y_train, batch_size=64, epochs=5);
Epoch 1/5
60000/60000 [==============================]  40s 674us/sample  loss: 0.5552  acc: 0.7955
Epoch 2/5
60000/60000 [==============================]  43s 717us/sample  loss: 0.3865  acc: 0.8596
Epoch 3/5
60000/60000 [==============================]  51s 852us/sample  loss: 0.3421  acc: 0.8765
Epoch 4/5
60000/60000 [==============================]  47s 782us/sample  loss: 0.3123  acc: 0.8851
Epoch 5/5
60000/60000 [==============================]  48s 802us/sample  loss: 0.2938  acc: 0.8936
Linearity of each Layer¶
Here we calculate the linearity of the model considering each layer as the output in turn. The values are averaged over 100 random instances sampled from the training set.
Extract layers¶
[9]:
inp = cnn.input
outs = {l.name: l.output for l in cnn.layers}
predict_fns = {name: K.function([inp], [out]) for name, out in outs.items()}
Calculate linearity¶
[10]:
# Infering feature ranges.
features_range = _infer_feature_range(x_test)
# Selecting random instances from training set.
rnd = np.random.randint(len(x_test)  101, size=100)
[11]:
lins_layers = {}
for name, l in predict_fns.items():
if name != 'input':
def predict_fn(x):
layer = l([x])
return layer[0]
if name == 'softmax':
lins_layers[name] = linearity_measure(predict_fn, x_test[rnd], feature_range=features_range,
agg='global', model_type='classifier', nb_samples=20)
else:
lins_layers[name] = linearity_measure(predict_fn, x_test[rnd], feature_range=features_range,
agg='global', model_type='regressor', nb_samples=20)
lins_layers_mean = {k: v.mean() for k, v in lins_layers.items()}
S = pd.Series(data=lins_layers_mean)
[12]:
colors = ['gray' for l in S[:1]]
colors.append('r')
ax = S.plot(kind='bar', linewidth=3, figsize=(15,10), color=colors, width=0.7, fontsize=18)
ax.set_ylabel('L measure', fontsize=20)
ax.set_xlabel('Layer', fontsize=20)
print('Linearity measure calculated taking as output each layer of a convolutional neural network.')
Linearity measure calculated taking as output each layer of a convolutional neural network.
Linearity measure in the locality of a given instance calculated taking as output each layer of a convolutional neural network trained on the fashion MNIST data set. * The linearity measure of the first convolutional layer conv_1 is 0, as expected since convolutions are linear operations. * The relu activation introduces nonlinearity, which is increased by maxpooling. Dropout layers and flatten layers do no change the output at inference time so the linearity doesn’t change. * The second convolutional layer conv_2 and the dense layers change the linearity even though they are linear operations. * The softmax layer in red is obtained by inverting the softmax function. * For more details see arxiv reference.
Linearity and categories¶
Here we calculate the linearity averaged over all instances belonging to the same class, for each class.
[13]:
class_groups = []
for i in range(10):
y = y_test.argmax(axis=1)
idxs_i = np.where(y == i)[0]
class_groups.append(x_test[idxs_i])
[14]:
def predict_fn(x):
return cnn.predict(x)
lins_classes = []
t_0 = time()
for j in range(len(class_groups)):
print('Calculating linearity for instances belonging to class {}'.format(j))
class_group = class_groups[j]
class_group = np.random.permutation(class_group)[:2000]
t_i = time()
lin = linearity_measure(predict_fn, class_group, feature_range=features_range,
agg='global', model_type='classifier', nb_samples=20)
t_i_1 = time()  t_i
print('Run time for class {}: {}'.format(j, t_i_1))
lins_classes.append(lin)
t_fin = time()  t_0
print('Total run time: {}'.format(t_fin))
Calculating linearity for instances belonging to class 0
Run time for class 0: 2.941605806350708
Calculating linearity for instances belonging to class 1
Run time for class 1: 3.3313376903533936
Calculating linearity for instances belonging to class 2
Run time for class 2: 3.178601026535034
Calculating linearity for instances belonging to class 3
Run time for class 3: 3.324582815170288
Calculating linearity for instances belonging to class 4
Run time for class 4: 3.085338830947876
Calculating linearity for instances belonging to class 5
Run time for class 5: 3.159513473510742
Calculating linearity for instances belonging to class 6
Run time for class 6: 3.4014275074005127
Calculating linearity for instances belonging to class 7
Run time for class 7: 3.3238165378570557
Calculating linearity for instances belonging to class 8
Run time for class 8: 2.9885218143463135
Calculating linearity for instances belonging to class 9
Run time for class 9: 3.4760279655456543
Total run time: 32.22387504577637
[15]:
df = pd.DataFrame(data=lins_classes).T
[16]:
ax = df.mean().plot(kind='bar', linewidth=3, figsize=(15,10), color='gray', width=0.7, fontsize=10)
ax.set_ylabel('L measure', fontsize=20)
ax.set_xlabel('Class', fontsize=20)
print("Linearity measure distribution means for each class in the fashion MNIST data set.")
Linearity measure distribution means for each class in the fashion MNIST data set.
[17]:
ax2 = df.plot(kind='hist', subplots=True, bins=20, figsize=(10,10), sharey=True)
for a in ax2:
a.set_xlabel('L measure', fontsize=20)
a.set_ylabel('', rotation=True, fontsize=10)
#ax2.set_ylabel('F', fontsize=10)
print('Linearity measure distributions for each class in the fashion MNIST data set.')
Linearity measure distributions for each class in the fashion MNIST data set.
This page was generated from examples/trustscore_iris.ipynb.
Trust Scores applied to Iris¶
It is important to know when a machine learning classifier’s predictions can be trusted. Relying on the classifier’s (uncalibrated) prediction probabilities is not optimal and can be improved upon. Trust scores measure the agreement between the classifier and a modified nearest neighbor classifier on the test set. The trust score is the ratio between the distance of the test instance to the nearest class different from the predicted class and the distance to the predicted class. Higher scores correspond to more trustworthy predictions. A score of 1 would mean that the distance to the predicted class is the same as to another class.
The original paper on which the algorithm is based is called To Trust Or Not To Trust A Classifier. Our implementation borrows heavily from https://github.com/google/TrustScore, as does the example notebook.
[1]:
import matplotlib
%matplotlib inline
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedShuffleSplit
from alibi.confidence import TrustScore
Load and prepare Iris dataset¶
[2]:
dataset = load_iris()
Scale data
[3]:
dataset.data = (dataset.data  dataset.data.mean(axis=0)) / dataset.data.std(axis=0)
Define training and test set
[4]:
idx = 140
X_train,y_train = dataset.data[:idx,:], dataset.target[:idx]
X_test, y_test = dataset.data[idx+1:,:], dataset.target[idx+1:]
Fit model and make predictions¶
[5]:
np.random.seed(0)
clf = LogisticRegression(solver='liblinear', multi_class='auto')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Predicted class: {}'.format(y_pred))
Predicted class: [2 2 2 2 2 2 2 2 2]
Basic Trust Score Usage¶
Initialise Trust Scores and fit on training data¶
The trust score algorithm builds kd trees for each class. The distance of the test instance to the \(k\)th nearest neighbor of each tree (or the average distance to the \(k\)th neighbor) can then be used to calculate the trust score. We can optionally filter out outliers in the training data before building the trees. The example below uses the distance_knn (filter_type
) method to filter out the 5% (alpha
) instances of each
class with the highest distance to its 10th nearest neighbor (k_filter
) in that class.
[6]:
ts = TrustScore(k_filter=10, # nb of neighbors used for kNN distance or probability to filter out outliers
alpha=.05, # target fraction of instances to filter out
filter_type='distance_knn', # filter method: None, 'distance_knn' or 'probability_knn'
leaf_size=40, # affects speed and memory to build KDTrees, memory scales with n_samples / leaf_size
metric='euclidean', # distance metric used for the KDTrees
dist_filter_type='point') # 'point' uses distance to knearest point
# 'mean' uses average distance from the 1st to the kth nearest point
[7]:
ts.fit(X_train, y_train, classes=3) # classes = nb of prediction classes
Calculate Trust Scores on test data¶
Since the trust score is the ratio between the distance of the test instance to the nearest class different from the predicted class and the distance to the predicted class, higher scores correspond to more trustworthy predictions. A score of 1 would mean that the distance to the predicted class is the same as to another class. The score
method returns arrays with both the trust scores and the class labels of the closest not predicted class.
[8]:
score, closest_class = ts.score(X_test,
y_pred, k=2, # kth nearest neighbor used
# to compute distances for each class
dist_type='point') # 'point' or 'mean' distance option
print('Trust scores: {}'.format(score))
print('\nClosest not predicted class: {}'.format(closest_class))
Trust scores: [2.574271277538439 2.1630334957870114 3.1629405367742223
2.7258494544157927 2.541748027539072 1.402878283257114 1.941073062524019
2.0601725424359296 2.1781121494573514]
Closest not predicted class: [1 1 1 1 1 1 1 1 1]
Comparison of Trust Scores with model prediction probabilities¶
Let’s compare the prediction probabilities from the classifier with the trust scores for each prediction. The first use case checks whether trust scores are better than the model’s prediction probabilities at identifying correctly classified examples, while the second use case does the same for incorrectly classified instances.
First we need to set up a couple of helper functions.
Define a function that handles model training and predictions for a simple logistic regression:
[9]:
def run_lr(X_train, y_train, X_test):
clf = LogisticRegression(solver='liblinear', multi_class='auto')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_pred_proba = clf.predict_proba(X_test)
probas = y_pred_proba[range(len(y_pred)), y_pred] # probabilities of predicted class
return y_pred, probas
Define the function that generates the precision plots:
[10]:
def plot_precision_curve(plot_title,
percentiles,
labels,
final_tp,
final_stderr,
final_misclassification,
colors = ['blue', 'darkorange', 'brown', 'red', 'purple']):
plt.title(plot_title, fontsize=18)
colors = colors + list(cm.rainbow(np.linspace(0, 1, len(final_tp))))
plt.xlabel("Percentile", fontsize=14)
plt.ylabel("Precision", fontsize=14)
for i, label in enumerate(labels):
ls = "" if ("Model" in label) else ""
plt.plot(percentiles, final_tp[i], ls, c=colors[i], label=label)
plt.fill_between(percentiles,
final_tp[i]  final_stderr[i],
final_tp[i] + final_stderr[i],
color=colors[i],
alpha=.1)
if 0. in percentiles:
plt.legend(loc="lower right", fontsize=14)
else:
plt.legend(loc="upper left", fontsize=14)
model_acc = 100 * (1  final_misclassification)
plt.axvline(x=model_acc, linestyle="dotted", color="black")
plt.show()
The function below trains the model on a number of folds, makes predictions, calculates the trust scores, and generates the precision curves to compare the trust scores with the model prediction probabilities:
[11]:
def run_precision_plt(X, y, nfolds, percentiles, run_model, test_size=.5,
plt_title="", plt_names=[], predict_correct=True, classes=3):
def stderr(L):
return np.std(L) / np.sqrt(len(L))
all_tp = [[[] for p in percentiles] for _ in plt_names]
misclassifications = []
mult = 1 if predict_correct else 1
folds = StratifiedShuffleSplit(n_splits=nfolds, test_size=test_size, random_state=0)
for train_idx, test_idx in folds.split(X, y):
# create train and test folds, train model and make predictions
X_train, y_train = X[train_idx, :], y[train_idx]
X_test, y_test = X[test_idx, :], y[test_idx]
y_pred, probas = run_lr(X_train, y_train, X_test)
# target points are the correctly classified points
target_points = np.where(y_pred == y_test)[0] if predict_correct else np.where(y_pred != y_test)[0]
final_curves = [probas]
# calculate trust scores
ts = TrustScore()
ts.fit(X_train, y_train, classes=classes)
scores, _ = ts.score(X_test, y_pred)
final_curves.append(scores) # contains prediction probabilities and trust scores
# check where prediction probabilities and trust scores are above a certain percentage level
for p, perc in enumerate(percentiles):
high_proba = [np.where(mult * curve >= np.percentile(mult * curve, perc))[0] for curve in final_curves]
if 0 in map(len, high_proba):
continue
# calculate fraction of values above percentage level that are correctly (or incorrectly) classified
tp = [len(np.intersect1d(hp, target_points)) / (1. * len(hp)) for hp in high_proba]
for i in range(len(plt_names)):
all_tp[i][p].append(tp[i]) # for each percentile, store fraction of values above cutoff value
misclassifications.append(len(target_points) / (1. * len(X_test)))
# average over folds for each percentile
final_tp = [[] for _ in plt_names]
final_stderr = [[] for _ in plt_names]
for p, perc in enumerate(percentiles):
for i in range(len(plt_names)):
final_tp[i].append(np.mean(all_tp[i][p]))
final_stderr[i].append(stderr(all_tp[i][p]))
for i in range(len(all_tp)):
final_tp[i] = np.array(final_tp[i])
final_stderr[i] = np.array(final_stderr[i])
final_misclassification = np.mean(misclassifications)
# create plot
plot_precision_curve(plt_title, percentiles, plt_names, final_tp, final_stderr, final_misclassification)
Detect correctly classified examples¶
The xaxis on the plot below shows the percentiles for the model prediction probabilities of the predicted class for each instance and for the trust scores. The yaxis represents the precision for each percentile. For each percentile level, we take the test examples whose trust score is above that percentile level and plot the percentage of those points that were correctly classified by the classifier. We do the same with the classifier’s own model confidence (i.e. softmax probabilities). For example, at percentile level 80, we take the top 20% scoring test examples based on the trust score and plot the percentage of those points that were correctly classified. We also plot the top 20% scoring test examples based on model probabilities and plot the percentage of those that were correctly classified. The vertical dotted line is the error of the logistic regression classifier. The plots are an average over 10 folds of the dataset with 50% of the data kept for the test set.
The Trust Score and Model Confidence curves then show that the model precision is typically higher when using the trust scores to rank the predictions compared to the model prediction probabilities.
[12]:
X = dataset.data
y = dataset.target
percentiles = [0 + 0.5 * i for i in range(200)]
nfolds = 10
plt_names = ['Model Confidence', 'Trust Score']
plt_title = 'Iris  Logistic Regression  Predict Correct'
[13]:
run_precision_plt(X, y, nfolds, percentiles, run_lr, plt_title=plt_title,
plt_names=plt_names, predict_correct=True)
Detect incorrectly classified examples¶
By taking the negative of the prediction probabilities and trust scores, we can also see on the plot below how the trust scores compare to the model predictions for incorrectly classified instances. The vertical dotted line is the accuracy of the logistic regression classifier. The plot shows the precision of identifying incorrectly classified instances. Higher is obviously better.
[14]:
percentiles = [50 + 0.5 * i for i in range(100)]
plt_title = 'Iris  Logistic Regression  Predict Incorrect'
run_precision_plt(X, y, nfolds, percentiles, run_lr, plt_title=plt_title,
plt_names=plt_names, predict_correct=False)
This page was generated from examples/trustscore_mnist.ipynb.
Trust Scores applied to MNIST¶
It is important to know when a machine learning classifier’s predictions can be trusted. Relying on the classifier’s (uncalibrated) prediction probabilities is not optimal and can be improved upon. Trust scores measure the agreement between the classifier and a modified nearest neighbor classifier on the test set. The trust score is the ratio between the distance of the test instance to the nearest class different from the predicted class and the distance to the predicted class. Higher scores correspond to more trustworthy predictions. A score of 1 would mean that the distance to the predicted class is the same as to another class.
The original paper on which the algorithm is based is called To Trust Or Not To Trust A Classifier. Our implementation borrows heavily from https://github.com/google/TrustScore, as does the example notebook.
Trust scores work best for low to medium dimensional feature spaces. This notebook illustrates how you can apply trust scores to high dimensional data like images by adding an additional preprocessing step in the form of an autoencoder to reduce the dimensionality. Other dimension reduction techniques like PCA can be used as well.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Input, UpSampling2D
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from alibi.confidence import TrustScore
[2]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
print('x_train shape:', x_train.shape, 'y_train shape:', y_train.shape)
plt.gray()
plt.imshow(x_test[0]);
x_train shape: (60000, 28, 28) y_train shape: (60000,)
Prepare data: scale, reshape and categorize
[3]:
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = np.reshape(x_train, x_train.shape + (1,))
x_test = np.reshape(x_test, x_test.shape + (1,))
print('x_train shape:', x_train.shape, 'x_test shape:', x_test.shape)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print('y_train shape:', y_train.shape, 'y_test shape:', y_test.shape)
x_train shape: (60000, 28, 28, 1) x_test shape: (10000, 28, 28, 1)
y_train shape: (60000, 10) y_test shape: (10000, 10)
[4]:
xmin, xmax = .5, .5
x_train = ((x_train  x_train.min()) / (x_train.max()  x_train.min())) * (xmax  xmin) + xmin
x_test = ((x_test  x_test.min()) / (x_test.max()  x_test.min())) * (xmax  xmin) + xmin
Define and train model¶
For this example we are not interested in optimizing model performance so a simple softmax classifier will do:
[5]:
def sc_model():
x_in = Input(shape=(28, 28, 1))
x = Flatten()(x_in)
x_out = Dense(10, activation='softmax')(x)
sc = Model(inputs=x_in, outputs=x_out)
sc.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
return sc
[6]:
sc = sc_model()
sc.summary()
sc.fit(x_train, y_train, batch_size=128, epochs=5);
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0
_________________________________________________________________
flatten (Flatten) (None, 784) 0
_________________________________________________________________
dense (Dense) (None, 10) 7850
=================================================================
Total params: 7,850
Trainable params: 7,850
Nontrainable params: 0
_________________________________________________________________
Epoch 1/5
60000/60000 [==============================]  1s 12us/sample  loss: 1.2706  acc: 0.6963
Epoch 2/5
60000/60000 [==============================]  1s 9us/sample  loss: 0.7030  acc: 0.8422
Epoch 3/5
60000/60000 [==============================]  1s 9us/sample  loss: 0.5762  acc: 0.8618
Epoch 4/5
60000/60000 [==============================]  1s 9us/sample  loss: 0.5155  acc: 0.8706
Epoch 5/5
60000/60000 [==============================]  1s 9us/sample  loss: 0.4787  acc: 0.8759
Evaluate the model on the test set:
[7]:
score = sc.evaluate(x_test, y_test, verbose=0)
print('Test accuracy: ', score[1])
Test accuracy: 0.8862
Define and train autoencoder¶
[8]:
def ae_model():
# encoder
x_in = Input(shape=(28, 28, 1))
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x_in)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(4, (3, 3), activation=None, padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same')(x)
encoder = Model(x_in, encoded)
# decoder
dec_in = Input(shape=(4, 4, 4))
x = Conv2D(4, (3, 3), activation='relu', padding='same')(dec_in)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(1, (3, 3), activation=None, padding='same')(x)
decoder = Model(dec_in, decoded)
# autoencoder = encoder + decoder
x_out = decoder(encoder(x_in))
autoencoder = Model(x_in, x_out)
autoencoder.compile(optimizer='adam', loss='mse')
return autoencoder, encoder, decoder
[9]:
ae, enc, dec = ae_model()
ae.summary()
ae.fit(x_train, x_train, batch_size=128, epochs=8, validation_data=(x_test, x_test))
ae.save('mnist_ae.h5')
enc.save('mnist_enc.h5')
Model: "model_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 28, 28, 1)] 0
_________________________________________________________________
model_1 (Model) (None, 4, 4, 4) 1612
_________________________________________________________________
model_2 (Model) (None, 28, 28, 1) 1757
=================================================================
Total params: 3,369
Trainable params: 3,369
Nontrainable params: 0
_________________________________________________________________
Train on 60000 samples, validate on 10000 samples
Epoch 1/8
60000/60000 [==============================]  29s 477us/sample  loss: 0.0606  val_loss: 0.0399
Epoch 2/8
60000/60000 [==============================]  34s 572us/sample  loss: 0.0341  val_loss: 0.0301
Epoch 3/8
60000/60000 [==============================]  43s 715us/sample  loss: 0.0288  val_loss: 0.0272
Epoch 4/8
60000/60000 [==============================]  48s 806us/sample  loss: 0.0265  val_loss: 0.0253
Epoch 5/8
60000/60000 [==============================]  41s 680us/sample  loss: 0.0249  val_loss: 0.0239
Epoch 6/8
60000/60000 [==============================]  39s 649us/sample  loss: 0.0237  val_loss: 0.0230
Epoch 7/8
60000/60000 [==============================]  33s 545us/sample  loss: 0.0229  val_loss: 0.0222
Epoch 8/8
60000/60000 [==============================]  29s 484us/sample  loss: 0.0224  val_loss: 0.0217
[10]:
ae = load_model('mnist_ae.h5')
enc = load_model('mnist_enc.h5')
Calculate Trust Scores¶
Initialize trust scores:
[11]:
ts = TrustScore()
The key is to fit and calculate the trust scores on the encoded instances. The encoded data still needs to be reshaped from (60000, 4, 4, 4) to (60000, 64) to comply with the kd tree format. This is handled internally:
[12]:
x_train_enc = enc.predict(x_train)
ts.fit(x_train_enc, y_train, classes=10) # 10 classes present in MNIST
Reshaping data from (60000, 4, 4, 4) to (60000, 64) so kd trees can be built.
We can now calculate the trust scores and closest not predicted classes of the predictions on the test set, using the distance to the 5th nearest neighbor in each class:
[13]:
n_samples = 1000 # calculate the trust scores for the first 1000 predictions on the test set
x_test_enc = enc.predict(x_test[:n_samples])
y_pred = sc.predict(x_test[:n_samples])
score, closest_class = ts.score(x_test_enc[:n_samples], y_pred, k=5)
Reshaping data from (1000, 4, 4, 4) to (1000, 64) so kd trees can be queried.
Let’s inspect which predictions have low and high trust scores:
[14]:
n = 5
# lowest and highest trust scores
idx_min, idx_max = np.argsort(score)[:n], np.argsort(score)[n:]
score_min, score_max = score[idx_min], score[idx_max]
closest_min, closest_max = closest_class[idx_min], closest_class[idx_max]
pred_min, pred_max = y_pred[idx_min], y_pred[idx_max]
imgs_min, imgs_max = x_test[idx_min], x_test[idx_max]
label_min, label_max = np.argmax(y_test[idx_min], axis=1), np.argmax(y_test[idx_max], axis=1)
# model confidence percentiles
max_proba = y_pred.max(axis=1)
# low score high confidence examples
idx_low = np.where((max_proba>0.80) & (max_proba<0.9) & (score<1))[0][:n]
score_low = score[idx_low]
closest_low = closest_class[idx_low]
pred_low = y_pred[idx_low]
imgs_low = x_test[idx_low]
label_low = np.argmax(y_test[idx_low], axis=1)
Low Trust Scores¶
The image below makes clear that the low trust scores correspond to misclassified images. Because the trust scores are significantly below 1, they correctly identified that the images belong to another class than the predicted class, and identified that class.
[15]:
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(1, n, i+1)
plt.imshow(imgs_min[i].reshape(28, 28))
plt.title('Model prediction: {} (p={:.2f}) \n Label: {} \n Trust score: {:.3f}' \
'\n Closest other class: {}'.format(pred_min[i].argmax(),pred_min[i].max(),
label_min[i], score_min[i], closest_min[i]), fontsize=14)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
High Trust Scores¶
The high trust scores on the other hand all are very clear 1’s:
[17]:
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(1, n, i+1)
plt.imshow(imgs_max[i].reshape(28, 28))
plt.title('Model prediction: {} (p={:.2f}) \n Label: {} \n Trust score: {:.3f}' \
'\n Closest other class: {}'.format(pred_max[i].argmax(),pred_max[i].max(),
label_max[i], score_max[i], closest_max[i]), fontsize=14)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
High model confidence, low trust score¶
Where trust scores really matter is when the predicted model confidence is relatively high (e.g. \(p\in[0.8, 0.9]\)) but the corresponding trust score is low, this can indicate samples for which the model is overconfident.The trust score provides a diagnostic for finding these examples:
[18]:
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(1, n, i+1)
plt.imshow(imgs_low[i].reshape(28, 28))
plt.title('Model prediction: {} (p={:.2f}) \n Label: {} \n Trust score: {:.3f}' \
'\n Closest other class: {}'.format(pred_low[i].argmax(),pred_low[i].max(),
label_low[i], score_low[i], closest_low[i]), fontsize=14)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
We can see several examples of an overconfident model predicting the wrong class, the low trust score, however, reveals that this is happening and the predictions should not be trusted despite the high model confidence.
In the following section we will see that on average trust scores outperform the model confidence for identifying correctly classified samples.
Comparison of Trust Scores with model prediction probabilities¶
Let’s compare the prediction probabilities from the classifier with the trust scores for each prediction by checking whether trust scores are better than the model’s prediction probabilities at identifying correctly classified examples.
First we need to set up a couple of helper functions.
Define a function that handles model training and predictions:
[19]:
def run_sc(X_train, y_train, X_test):
clf = sc_model()
clf.fit(X_train, y_train, batch_size=128, epochs=5, verbose=0)
y_pred_proba = clf.predict(X_test)
y_pred = np.argmax(y_pred_proba, axis=1)
probas = y_pred_proba[range(len(y_pred)), y_pred] # probabilities of predicted class
return y_pred, probas
Define the function that generates the precision plots:
[20]:
def plot_precision_curve(plot_title,
percentiles,
labels,
final_tp,
final_stderr,
final_misclassification,
colors = ['blue', 'darkorange', 'brown', 'red', 'purple']):
plt.title(plot_title, fontsize=18)
colors = colors + list(cm.rainbow(np.linspace(0, 1, len(final_tp))))
plt.xlabel("Percentile", fontsize=14)
plt.ylabel("Precision", fontsize=14)
for i, label in enumerate(labels):
ls = "" if ("Model" in label) else ""
plt.plot(percentiles, final_tp[i], ls, c=colors[i], label=label)
plt.fill_between(percentiles,
final_tp[i]  final_stderr[i],
final_tp[i] + final_stderr[i],
color=colors[i],
alpha=.1)
if 0. in percentiles:
plt.legend(loc="lower right", fontsize=14)
else:
plt.legend(loc="upper left", fontsize=14)
model_acc = 100 * (1  final_misclassification)
plt.axvline(x=model_acc, linestyle="dotted", color="black")
plt.show()
The function below trains the model on a number of folds, makes predictions, calculates the trust scores, and generates the precision curves to compare the trust scores with the model prediction probabilities:
[21]:
def run_precision_plt(X, y, nfolds, percentiles, run_model, test_size=.2,
plt_title="", plt_names=[], predict_correct=True, classes=10):
def stderr(L):
return np.std(L) / np.sqrt(len(L))
all_tp = [[[] for p in percentiles] for _ in plt_names]
misclassifications = []
mult = 1 if predict_correct else 1
folds = StratifiedShuffleSplit(n_splits=nfolds, test_size=test_size, random_state=0)
for train_idx, test_idx in folds.split(X, y):
# create train and test folds, train model and make predictions
X_train, y_train = X[train_idx, :], y[train_idx, :]
X_test, y_test = X[test_idx, :], y[test_idx, :]
y_pred, probas = run_sc(X_train, y_train, X_test)
# target points are the correctly classified points
y_test_class = np.argmax(y_test, axis=1)
target_points = (np.where(y_pred == y_test_class)[0] if predict_correct else
np.where(y_pred != y_test_class)[0])
final_curves = [probas]
# calculate trust scores
ts = TrustScore()
ts.fit(enc.predict(X_train), y_train, classes=classes)
scores, _ = ts.score(enc.predict(X_test), y_pred, k=5)
final_curves.append(scores) # contains prediction probabilities and trust scores
# check where prediction probabilities and trust scores are above a certain percentage level
for p, perc in enumerate(percentiles):
high_proba = [np.where(mult * curve >= np.percentile(mult * curve, perc))[0] for curve in final_curves]
if 0 in map(len, high_proba):
continue
# calculate fraction of values above percentage level that are correctly (or incorrectly) classified
tp = [len(np.intersect1d(hp, target_points)) / (1. * len(hp)) for hp in high_proba]
for i in range(len(plt_names)):
all_tp[i][p].append(tp[i]) # for each percentile, store fraction of values above cutoff value
misclassifications.append(len(target_points) / (1. * len(X_test)))
# average over folds for each percentile
final_tp = [[] for _ in plt_names]
final_stderr = [[] for _ in plt_names]
for p, perc in enumerate(percentiles):
for i in range(len(plt_names)):
final_tp[i].append(np.mean(all_tp[i][p]))
final_stderr[i].append(stderr(all_tp[i][p]))
for i in range(len(all_tp)):
final_tp[i] = np.array(final_tp[i])
final_stderr[i] = np.array(final_stderr[i])
final_misclassification = np.mean(misclassifications)
# create plot
plot_precision_curve(plt_title, percentiles, plt_names, final_tp, final_stderr, final_misclassification)
Detect correctly classified examples¶
The xaxis on the plot below shows the percentiles for the model prediction probabilities of the predicted class for each instance and for the trust scores. The yaxis represents the precision for each percentile. For each percentile level, we take the test examples whose trust score is above that percentile level and plot the percentage of those points that were correctly classified by the classifier. We do the same with the classifier’s own model confidence (i.e. softmax probabilities). For example, at percentile level 80, we take the top 20% scoring test examples based on the trust score and plot the percentage of those points that were correctly classified. We also plot the top 20% scoring test examples based on model probabilities and plot the percentage of those that were correctly classified. The vertical dotted line is the error of the classifier. The plots are an average over 2 folds of the dataset with 20% of the data kept for the test set.
The Trust Score and Model Confidence curves then show that the model precision is typically higher when using the trust scores to rank the predictions compared to the model prediction probabilities.
[22]:
X = x_train
y = y_train
percentiles = [0 + 0.5 * i for i in range(200)]
nfolds = 2
plt_names = ['Model Confidence', 'Trust Score']
plt_title = 'MNIST  Softmax Classifier  Predict Correct'
[23]:
run_precision_plt(X, y, nfolds, percentiles, run_sc, plt_title=plt_title,
plt_names=plt_names, predict_correct=True)
Reshaping data from (48000, 4, 4, 4) to (48000, 64) so kd trees can be built.
Reshaping data from (12000, 4, 4, 4) to (12000, 64) so kd trees can be queried.
Reshaping data from (48000, 4, 4, 4) to (48000, 64) so kd trees can be built.
Reshaping data from (12000, 4, 4, 4) to (12000, 64) so kd trees can be queried.
alibi¶
alibi package¶
Subpackages¶
alibi.confidence package¶
The ‘alibi.confidence’ module includes trust scores.

alibi.confidence.
linearity_measure
(predict_fn, x, feature_range=None, method='grid', X_train=None, epsilon=0.04, nb_samples=10, res=100, alphas=None, agg='global', model_type='classifier')[source]¶ Calculate the linearity measure of the model around an instance of interest x.
 Parameters
predict_fn (
Callable
) – Predict function.x (numpy.ndarray) – Instance of interest.
feature_range (
Union
[List
, numpy.ndarray,None
]) – Array with min and max values for each feature.method (
str
) – Method for sampling. Supported values ‘knn’ or ‘grid’.X_train (
Optional
[numpy.ndarray]) – Training set.epsilon (
float
) – Size of the sampling region as a percentage of the feature range.nb_samples (
int
) – Number of samples to generate.res (
int
) – Resolution of the grid. Number of intervals in which the features range is discretized.alphas (
Optional
[numpy.ndarray]) – Coefficients in the superposition.agg (
str
) – Aggregation method. Supported values ‘global’ or ‘pairwise’.model_type (
str
) – Type of task. Supported values ‘regressor’ or ‘classifier’.
 Return type
numpy.ndarray
 Returns
Linearity measure

class
alibi.confidence.
LinearityMeasure
(method='grid', epsilon=0.04, nb_samples=10, res=100, alphas=None, model_type='classifier', agg='pairwise', verbose=False)[source]¶ Bases:
object

__init__
(method='grid', epsilon=0.04, nb_samples=10, res=100, alphas=None, model_type='classifier', agg='pairwise', verbose=False)[source]¶  Parameters
method (
str
) – Method for sampling. Supported methods are ‘knn’ or ‘grid’.epsilon (
float
) – Size of the sampling region around the central instance as a percentage of the features range.nb_samples (
int
) – Number of samples to generate.res (
int
) – Resolution of the grid. Number of intervals in which the feature range is discretized.alphas (
Optional
[numpy.ndarray]) – Coefficients in the superposition.agg (
str
) – Aggregation method. Supported values are ‘global’ or ‘pairwise’.model_type (
str
) – Type of task. Supported values are ‘regressor’ or ‘classifier’.
 Return type
None


class
alibi.confidence.
TrustScore
(k_filter=10, alpha=0.0, filter_type=None, leaf_size=40, metric='euclidean', dist_filter_type='point')[source]¶ Bases:
object

__init__
(k_filter=10, alpha=0.0, filter_type=None, leaf_size=40, metric='euclidean', dist_filter_type='point')[source]¶ Initialize trust scores.
 Parameters
k_filter (
int
) – Number of neighbors used during either kNN distance or probability filtering.alpha (
float
) – Fraction of instances to filter out to reduce impact of outliers.filter_type (
Optional
[str
]) – Filter method; either ‘distance_knn’ or ‘probability_knn’leaf_size (
int
) – Number of points at which to switch to bruteforce. Affects speed and memory required to build trees. Memory to store the tree scales with n_samples / leaf_size.metric (
str
) – Distance metric used for the tree. See sklearn’s DistanceMetric class for a list of available metrics.dist_filter_type (
str
) – Use either the distance to the knearest point (dist_filter_type = ‘point’) or the average distance from the first to the knearest point in the data (dist_filter_type = ‘mean’).
 Return type
None

filter_by_distance_knn
(X)[source]¶ Filter out instances with low kNN density. Calculate distance to knearest point in the data for each instance and remove instances above a cutoff distance.
 Parameters
X (numpy.ndarray) – Data
 Return type
numpy.ndarray
 Returns
Filtered data.

filter_by_probability_knn
(X, Y)[source]¶ Filter out instances with high label disagreement amongst its k nearest neighbors.
 Parameters
X (numpy.ndarray) – Data
Y (numpy.ndarray) – Predicted class labels
 Return type
Tuple
[numpy.ndarray, numpy.ndarray] Returns
Filtered data and labels.

score
(X, Y, k=2, dist_type='point')[source]¶ Calculate trust scores = ratio of distance to closest class other than the predicted class to distance to predicted class.
 Parameters
X (numpy.ndarray) – Instances to calculate trust score for.
Y (numpy.ndarray) – Either prediction probabilities for each class or the predicted class.
k (
int
) – Number of nearest neighbors used for distance calculation.dist_type (
str
) – Use either the distance to the knearest point (dist_type = ‘point’) or the average distance from the first to the knearest point in the data (dist_type = ‘mean’).
 Return type
Tuple
[numpy.ndarray, numpy.ndarray] Returns
Batch with trust scores and the closest not predicted class.

Submodules¶

class
alibi.confidence.model_linearity.
LinearityMeasure
(method='grid', epsilon=0.04, nb_samples=10, res=100, alphas=None, model_type='classifier', agg='pairwise', verbose=False)[source]¶ Bases:
object

__init__
(method='grid', epsilon=0.04, nb_samples=10, res=100, alphas=None, model_type='classifier', agg='pairwise', verbose=False)[source]¶  Parameters
method (
str
) – Method for sampling. Supported methods are ‘knn’ or ‘grid’.epsilon (
float
) – Size of the sampling region around the central instance as a percentage of the features range.nb_samples (
int
) – Number of samples to generate.res (
int
) – Resolution of the grid. Number of intervals in which the feature range is discretized.alphas (
Optional
[numpy.ndarray]) – Coefficients in the superposition.agg (
str
) – Aggregation method. Supported values are ‘global’ or ‘pairwise’.model_type (
str
) – Type of task. Supported values are ‘regressor’ or ‘classifier’.
 Return type
None


alibi.confidence.model_linearity.
linearity_measure
(predict_fn, x, feature_range=None, method='grid', X_train=None, epsilon=0.04, nb_samples=10, res=100, alphas=None, agg='global', model_type='classifier')[source]¶ Calculate the linearity measure of the model around an instance of interest x.
 Parameters
predict_fn (
Callable
) – Predict function.x (numpy.ndarray) – Instance of interest.
feature_range (
Union
[List
, numpy.ndarray,None
]) – Array with min and max values for each feature.method (
str
) – Method for sampling. Supported values ‘knn’ or ‘grid’.X_train (
Optional
[numpy.ndarray]) – Training set.epsilon (
float
) – Size of the sampling region as a percentage of the feature range.nb_samples (
int
) – Number of samples to generate.res (
int
) – Resolution of the grid. Number of intervals in which the features range is discretized.alphas (
Optional
[numpy.ndarray]) – Coefficients in the superposition.agg (
str
) – Aggregation method. Supported values ‘global’ or ‘pairwise’.model_type (
str
) – Type of task. Supported values ‘regressor’ or ‘classifier’.
 Return type
numpy.ndarray
 Returns
Linearity measure

class
alibi.confidence.trustscore.
TrustScore
(k_filter=10, alpha=0.0, filter_type=None, leaf_size=40, metric='euclidean', dist_filter_type='point')[source]¶ Bases:
object

__init__
(k_filter=10, alpha=0.0, filter_type=None, leaf_size=40, metric='euclidean', dist_filter_type='point')[source]¶ Initialize trust scores.
 Parameters
k_filter (
int
) – Number of neighbors used during either kNN distance or probability filtering.alpha (
float
) – Fraction of instances to filter out to reduce impact of outliers.filter_type (
Optional
[str
]) – Filter method; either ‘distance_knn’ or ‘probability_knn’leaf_size (
int
) – Number of points at which to switch to bruteforce. Affects speed and memory required to build trees. Memory to store the tree scales with n_samples / leaf_size.metric (
str
) – Distance metric used for the tree. See sklearn’s DistanceMetric class for a list of available metrics.dist_filter_type (
str
) – Use either the distance to the knearest point (dist_filter_type = ‘point’) or the average distance from the first to the knearest point in the data (dist_filter_type = ‘mean’).
 Return type
None

filter_by_distance_knn
(X)[source]¶ Filter out instances with low kNN density. Calculate distance to knearest point in the data for each instance and remove instances above a cutoff distance.
 Parameters
X (numpy.ndarray) – Data
 Return type
numpy.ndarray
 Returns
Filtered data.

filter_by_probability_knn
(X, Y)[source]¶ Filter out instances with high label disagreement amongst its k nearest neighbors.
 Parameters
X (numpy.ndarray) – Data
Y (numpy.ndarray) – Predicted class labels
 Return type
Tuple
[numpy.ndarray, numpy.ndarray] Returns
Filtered data and labels.

score
(X, Y, k=2, dist_type='point')[source]¶ Calculate trust scores = ratio of distance to closest class other than the predicted class to distance to predicted class.
 Parameters
X (numpy.ndarray) – Instances to calculate trust score for.
Y (numpy.ndarray) – Either prediction probabilities for each class or the predicted class.
k (
int
) – Number of nearest neighbors used for distance calculation.dist_type (
str
) – Use either the distance to the knearest point (dist_type = ‘point’) or the average distance from the first to the knearest point in the data (dist_type = ‘mean’).
 Return type
Tuple
[numpy.ndarray, numpy.ndarray] Returns
Batch with trust scores and the closest not predicted class.

alibi.explainers package¶
The ‘alibi.explainers’ module includes feature importance, counterfactual and anchorbased explainers.

class
alibi.explainers.
AnchorTabular
(predict_fn, feature_names, categorical_names={})[source]¶ Bases:
object

__init__
(predict_fn, feature_names, categorical_names={})[source]¶ Initialize the anchor tabular explainer.

explain
(X, threshold=0.95, delta=0.1, tau=0.15, batch_size=100, max_anchor_size=None, desired_label=None, **kwargs)[source]¶ Explain instance and return anchor with metadata.
 Parameters
X (numpy.ndarray) – Instance to be explained
threshold (
float
) – Minimum precision thresholddelta (
float
) – Used to compute betatau (
float
) – Margin between lower confidence bound and minimum precision or upper boundbatch_size (
int
) – Batch size used for samplingmax_anchor_size (
Optional
[int
]) – Maximum number of features in anchordesired_label (
Optional
[int
]) – Label to use as true label for the instance to be explained
 Return type
 Returns
explanation – Dictionary containing the anchor explaining the instance with additional metadata

fit
(train_data, disc_perc=[25, 50, 75])[source]¶ Fit discretizer to train data to bin ordinal features and compute statistics for ordinal features.
 Parameters
train_data (numpy.ndarray) – Representative sample from the training data
disc_perc (
list
) – List with percentiles (int) used for discretization
 Return type
None

get_sample_fn
(X, desired_label=None)[source]¶ Create sampling function and mapping dictionary between categorized data and the feature types and values.
 Parameters
 Return type
 Returns
sample_fn – Function returning raw and categorized sampled data, and labels
mapping –
 Dict: key = feature column or bin for ordinal features in categorized data; value = tuple containing
(feature column, flag for categorical/ordinal feature, feature value or bin value)

sample_from_train
(conditions_eq, conditions_neq, conditions_geq, conditions_leq, num_samples)[source]¶ Sample data from training set but keep features which are present in the proposed anchor the same as the feature value or bin (for ordinal features) as the instance to be explained.
 Parameters
conditions_eq (
dict
) – Dict: key = feature column; value = categorical feature valueconditions_neq (
dict
) – Not used at the momentconditions_geq (
dict
) – Dict: key = feature column; value = bin value of ordinal feature where bin value < feature valueconditions_leq (
dict
) – Dict: key = feature column; value = bin value of ordinal feature where bin value >= feature valuenum_samples (
int
) – Number of samples used when sampling from training set
 Return type
numpy.ndarray
 Returns
sample – Sampled data from training set


class
alibi.explainers.
AnchorText
(nlp, predict_fn)[source]¶ Bases:
object

explain
(text, threshold=0.95, delta=0.1, tau=0.15, batch_size=100, top_n=100, desired_label=None, use_similarity_proba=False, use_unk=True, sample_proba=0.5, temperature=1.0, **kwargs)[source]¶ Explain instance and return anchor with metadata.
 Parameters
text (
str
) – Text instance to be explainedthreshold (
float
) – Minimum precision thresholddelta (
float
) – Used to compute betatau (
float
) – Margin between lower confidence bound and minimum precision or upper boundbatch_size (
int
) – Batch size used for samplingtop_n (
int
) – Number of similar words to sample for perturbations, only used if use_proba=Truedesired_label (
Optional
[int
]) – Label to use as true label for the instance to be explaineduse_similarity_proba (
bool
) – Bool whether to sample according to a similarity score with the corpus embeddings. use_unk needs to be False in order for use_similarity_proba equals True to be used.use_unk (
bool
) – If True, perturbation distribution will replace words randomly with UNKs. If False, words will be replaced by similar words using word embeddings.sample_proba (
float
) – Sample probability if use_similarity_proba is Falsetemperature (
float
) – Sample weight hyperparameter if use_similarity_proba equals Truekwargs (
Any
) – Other keyword arguments passed to the anchor beam search and the text sampling and perturbation functions
 Return type
 Returns
explanation – Dictionary containing the anchor explaining the instance with additional metadata

get_sample_fn
(text, desired_label=None, use_similarity_proba=False, use_unk=True, sample_proba=0.5, top_n=100, temperature=0.4, **kwargs)[source]¶ Create sampling function as well as lists with the words and word positions in the text.
 Parameters
text (
str
) – Text instance to be explaineddesired_label (
Optional
[int
]) – Label to use as true label for the instance to be explaineduse_similarity_proba (
bool
) – Bool whether to sample according to a similarity score with the corpus embeddingsuse_unk (
bool
) – If True, perturbation distribution will replace words randomly with UNKs. If False, words will be replaced by similar words using word embeddings.sample_proba (
float
) – Sample probability if use_similarity_proba is Falsetop_n (
int
) – Sample using only top_n instances from similar words in the corpustemperature (
float
) – Sample weight hyperparameter if use_similarity_proba equals True
 Return type
 Returns
words – List with words in the text
positions – List with positions of the words in the text
sample_fn – Function returning perturbed text instances, matrix with flags for perturbed words and labels

perturb_sentence
(text, present, n, sample_proba=0.5, top_n=100, forbidden={}, forbidden_tags={'PRP$'}, forbidden_words={'be'}, pos={'ADJ', 'ADP', 'ADV', 'DET', 'NOUN', 'VERB'}, use_similarity_proba=True, temperature=1.0, **kwargs)[source]¶ Perturb the text instance to be explained.
 Parameters
text (
str
) – Text instance to be explainedpresent (
list
) – List with the word index in the text for the words in the proposed anchorn (
int
) – Number of samples used when sampling from the corpussample_proba (
float
) – Sample probability for a word if use_similarity_proba is Falsetop_n (
int
) – Keep only top_n instances from similar words in the corpusforbidden (
set
) – Forbidden lemmasforbidden_tags (
set
) – Forbidden POS tagsforbidden_words (
set
) – Forbidden wordspos (
set
) – POS that can be changed during perturbationuse_similarity_proba (
bool
) – Bool whether to sample according to a similarity score with the corpus embeddingstemperature (
float
) – Sample weight hyperparameter if use_similarity_proba equals True
 Return type
 Returns
raw_data – List with num_samples of perturbed text instance
data – Matrix with 1s and 0s indicating whether a word in the text has not been perturbed for each sample


class
alibi.explainers.
AnchorImage
(predict_fn, image_shape, segmentation_fn='slic', segmentation_kwargs=None, images_background=None)[source]¶ Bases:
object

__init__
(predict_fn, image_shape, segmentation_fn='slic', segmentation_kwargs=None, images_background=None)[source]¶ Initialize anchor image explainer.
 Parameters
predict_fn (
Callable
) – Model prediction function.image_shape (
tuple
) – Shape of the image to be explained.segmentation_fn (
Any
) – Any of the built in segmentation function strings: ‘felzenszwalb’, ‘slic’ or ‘quickshift’ or a custom segmentation function (callable) which returns an image mask with labels for each superpixel. See http://scikitimage.org/docs/dev/api/skimage.segmentation.html for more info.segmentation_kwargs (
Optional
[dict
]) – Keyword arguments for the built in segmentation functions.images_background (
Optional
[numpy.ndarray]) – Images to overlay superpixels on.
 Return type
None

explain
(image, threshold=0.95, delta=0.1, tau=0.15, batch_size=100, p_sample=0.5, **kwargs)[source]¶ Explain instance and return anchor with metadata.
 Parameters
image (numpy.ndarray) – Image to be explained
threshold (
float
) – Minimum precision thresholddelta (
float
) – Used to compute betatau (
float
) – Margin between lower confidence bound and minimum precision of upper boundbatch_size (
int
) – Batch size used for samplingp_sample (
float
) – Probability for a pixel to be represented by the average value of its superpixel
 Returns
explanation – Dictionary containing the anchor explaining the instance with additional metadata

get_sample_fn
(image, p_sample=0.5)[source]¶ Create sampling function and superpixel mask.
 Parameters
image (numpy.ndarray) – Image to be explained
p_sample (
float
) – Probability for a pixel to be represented by the average value of its superpixel or the pixel value of a superimposed image
 Return type
 Returns
segments – Superpixels generated from image
sample_fn – Function returning the sampled images with label


class
alibi.explainers.
CEM
(predict, mode, shape, kappa=0.0, beta=0.1, feature_range=(10000000000.0, 10000000000.0), gamma=0.0, ae_model=None, learning_rate_init=0.01, max_iterations=1000, c_init=10.0, c_steps=10, eps=(0.001, 0.001), clip=(100.0, 100.0), update_num_grad=1, no_info_val=None, write_dir=None, sess=None)[source]¶ Bases:
object

__init__
(predict, mode, shape, kappa=0.0, beta=0.1, feature_range=(10000000000.0, 10000000000.0), gamma=0.0, ae_model=None, learning_rate_init=0.01, max_iterations=1000, c_init=10.0, c_steps=10, eps=(0.001, 0.001), clip=(100.0, 100.0), update_num_grad=1, no_info_val=None, write_dir=None, sess=None)[source]¶ Initialize contrastive explanation method. Paper: https://arxiv.org/abs/1802.07623
 Parameters
predict (
Union
[Callable
, tensorflow.keras.Model,Model')
]) – Keras or TensorFlow model or any other model’s prediction function returning class probabilitiesmode (
str
) – Find pertinant negatives (‘PN’) or pertinant positives (‘PP’)shape (
tuple
) – Shape of input data starting with batch sizekappa (
float
) – Confidence parameter for the attack loss termbeta (
float
) – Regularization constant for L1 loss termfeature_range (
tuple
) – Tuple with min and max ranges to allow for perturbed instances. Min and max ranges can be floats or numpy arrays with dimension (1x nb of features) for featurewise rangesgamma (
float
) – Regularization constant for optional autoencoder loss termae_model (
Union
[tensorflow.keras.Model,Model')
]) – Optional autoencoder model used for loss regularizationlearning_rate_init (
float
) – Initial learning rate of optimizermax_iterations (
int
) – Maximum number of iterations for finding a PN or PPc_init (
float
) – Initial value to scale the attack loss termc_steps (
int
) – Number of iterations to adjust the constant scaling the attack loss termeps (
tuple
) – If numerical gradients are used to compute dL/dx = (dL/dp) * (dp/dx), then eps[0] is used to calculate dL/dp and eps[1] is used for dp/dx. eps[0] and eps[1] can be a combination of float values and numpy arrays. For eps[0], the array dimension should be (1x nb of prediction categories) and for eps[1] it should be (1x nb of features)clip (
tuple
) – Tuple with min and max clip ranges for both the numerical gradients and the gradients obtained from the TensorFlow graphupdate_num_grad (
int
) – If numerical gradients are used, they will be updated every update_num_grad iterationsno_info_val (
Union
[float
, numpy.ndarray]) – Global or featurewise value considered as containing no informationwrite_dir (
str
) – Directory to write tensorboard files tosess (tensorflow.compat.v1.Session) – Optional Tensorflow session that will be used if passed instead of creating or inferring one internally
 Return type

attack
(X, Y, verbose=False)[source]¶ Find pertinent negative or pertinent positive for instance X using a fast iterative shrinkagethresholding algorithm (FISTA).

fit
(train_data, no_info_type='median')[source]¶ Get ‘no information’ values from the training data.
 Parameters
train_data (numpy.ndarray) – Representative sample from the training data
no_info_type (
str
) – Median or mean value by feature supported
 Return type
None

get_gradients
(X, Y)[source]¶ Compute numerical gradients of the attack loss term: dL/dx = (dL/dP)*(dP/dx) with L = loss_attack_s; P = predict; x = adv_s
 Parameters
X (numpy.ndarray) – Instance around which gradient is evaluated
Y (numpy.ndarray) – Onehot representation of instance labels
 Return type
numpy.ndarray
 Returns
Array with gradients.

loss_fn
(pred_proba, Y)[source]¶ Compute the attack loss.
 Parameters
pred_proba (numpy.ndarray) – Prediction probabilities of an instance
Y (numpy.ndarray) – Onehot representation of instance labels
 Return type
numpy.ndarray
 Returns
Loss of the attack.


class
alibi.explainers.
CounterFactual
(predict_fn, shape, distance_fn='l1', target_proba=1.0, target_class='other', max_iter=1000, early_stop=50, lam_init=0.1, max_lam_steps=10, tol=0.05, learning_rate_init=0.1, feature_range=(10000000000.0, 10000000000.0), eps=0.01, init='identity', decay=True, write_dir=None, debug=False, sess=None)[source]¶ Bases:
object

__init__
(predict_fn, shape, distance_fn='l1', target_proba=1.0, target_class='other', max_iter=1000, early_stop=50, lam_init=0.1, max_lam_steps=10, tol=0.05, learning_rate_init=0.1, feature_range=(10000000000.0, 10000000000.0), eps=0.01, init='identity', decay=True, write_dir=None, debug=False, sess=None)[source]¶ Initialize counterfactual explanation method based on Wachter et al. (2017)
 Parameters
predict_fn (
Union
[Callable
, tensorflow.keras.Model,Model')
]) – Keras or TensorFlow model or any other model’s prediction function returning class probabilitiesshape (
Tuple
[int
, …]) – Shape of input data starting with batch sizedistance_fn (
str
) – Distance function to use in the loss termtarget_proba (
float
) – Target probability for the counterfactual to reachtarget_class (
Union
[str
,int
]) – Target class for the counterfactual to reach, one of ‘other’, ‘same’ or an integer denoting desired class membership for the counterfactual instancemax_iter (
int
) – Maximum number of interations to run the gradient descent for (inner loop)early_stop (
int
) – Number of steps after which to terminate gradient descent if all or none of found instances are solutionslam_init (
float
) – Initial regularization constant for the prediction part of the Wachter lossmax_lam_steps (
int
) – Maximum number of times to adjust the regularization constant (outer loop) before terminating the searchtol (
float
) – Tolerance for the counterfactual target probabilitylearning_rate_init – Initial learning rate for each outer loop of lambda
feature_range (
Union
[Tuple
,str
]) – Tuple with min and max ranges to allow for perturbed instances. Min and max ranges can be floats or numpy arrays with dimension (1 x nb of features) for featurewise rangeseps (
Union
[float
, numpy.ndarray]) – Gradient step sizes used in calculating numerical gradients, defaults to a single value for all features, but can be passed an array for featurewise step sizesinit (
str
) – Initialization method for the search of counterfactuals, currently must be ‘identity’decay (
bool
) – Flag to decay learning rate to zero for each outer loop over lambdawrite_dir (
str
) – Directory to write Tensorboard files todebug (
bool
) – Flag to write Tensorboard summaries for debuggingsess (tensorflow.compat.v1.Session) – Optional Tensorflow session that will be used if passed instead of creating or inferring one internally
 Return type


class
alibi.explainers.
CounterFactualProto
(predict, shape, kappa=0.0, beta=0.1, feature_range=(10000000000.0, 10000000000.0), gamma=0.0, ae_model=None, enc_model=None, theta=0.0, cat_vars=None, ohe=False, use_kdtree=False, learning_rate_init=0.01, max_iterations=1000, c_init=10.0, c_steps=10, eps=(0.001, 0.001), clip=(1000.0, 1000.0), update_num_grad=1, write_dir=None, sess=None)[source]¶ Bases:
object

__init__
(predict, shape, kappa=0.0, beta=0.1, feature_range=(10000000000.0, 10000000000.0), gamma=0.0, ae_model=None, enc_model=None, theta=0.0, cat_vars=None, ohe=False, use_kdtree=False, learning_rate_init=0.01, max_iterations=1000, c_init=10.0, c_steps=10, eps=(0.001, 0.001), clip=(1000.0, 1000.0), update_num_grad=1, write_dir=None, sess=None)[source]¶ Initialize prototypical counterfactual method.
 Parameters
predict (
Union
[Callable
, tensorflow.keras.Model,Model')
]) – Keras or TensorFlow model or any other model’s prediction function returning class probabilitiesshape (
tuple
) – Shape of input data starting with batch sizekappa (
float
) – Confidence parameter for the attack loss termbeta (
float
) – Regularization constant for L1 loss termfeature_range (
tuple
) – Tuple with min and max ranges to allow for perturbed instances. Min and max ranges can be floats or numpy arrays with dimension (1x nb of features) for featurewise rangesgamma (
float
) – Regularization constant for optional autoencoder loss termae_model (
Union
[tensorflow.keras.Model,Model')
]) – Optional autoencoder model used for loss regularizationenc_model (
Union
[tensorflow.keras.Model,Model')
]) – Optional encoder model used to guide instance perturbations towards a class prototypetheta (
float
) – Constant for the prototype search loss termcat_vars (
dict
) – Dict with as keys the categorical columns and as values the number of categories per categorical variable.ohe (
bool
) – Whether the categorical variables are onehot encoded (OHE) or not. If not OHE, they are assumed to have ordinal encodings.use_kdtree (
bool
) – Whether to use kd trees for the prototype loss term if no encoder is availablelearning_rate_init (
float
) – Initial learning rate of optimizermax_iterations (
int
) – Maximum number of iterations for finding a counterfactualc_init (
float
) – Initial value to scale the attack loss termc_steps (
int
) – Number of iterations to adjust the constant scaling the attack loss termeps (
tuple
) – If numerical gradients are used to compute dL/dx = (dL/dp) * (dp/dx), then eps[0] is used to calculate dL/dp and eps[1] is used for dp/dx. eps[0] and eps[1] can be a combination of float values and numpy arrays. For eps[0], the array dimension should be (1x nb of prediction categories) and for eps[1] it should be (1x nb of features)clip (
tuple
) – Tuple with min and max clip ranges for both the numerical gradients and the gradients obtained from the TensorFlow graphupdate_num_grad (
int
) – If numerical gradients are used, they will be updated every update_num_grad iterationswrite_dir (
str
) – Directory to write tensorboard files tosess (tensorflow.compat.v1.Session) – Optional Tensorflow session that will be used if passed instead of creating or inferring one internally
 Return type

attack
(X, Y, target_class=None, k=None, k_type='mean', threshold=0.0, verbose=False, print_every=100, log_every=100)[source]¶ Find a counterfactual (CF) for instance X using a fast iterative shrinkagethresholding algorithm (FISTA).
 Parameters
X (numpy.ndarray) – Instance to attack
Y (numpy.ndarray) – Labels for X as onehotencoding
target_class (
Optional
[list
]) – List with target classes used to find closest prototype. If None, the nearest prototype except for the predict class on the instance is used.k (
Optional
[int
]) – Number of nearest instances used to define the prototype for a class. Defaults to using all instances belonging to the class if an encoder is used and to 1 for kd trees.k_type (
str
) – Use either the average encoding of the k nearest instances in a class (k_type=’mean’) or the knearest encoding in the class (k_type=’point’) to define the prototype of that class. Only relevant if an encoder is used to define the prototypes.threshold (
float
) – Threshold level for the ratio between the distance of the counterfactual to the prototype of the predicted class for the original instance over the distance to the prototype of the predicted class for the counterfactual. If the trust score is below the threshold, the proposed counterfactual does not meet the requirements.verbose (
bool
) – Print intermediate results of optimization if Trueprint_every (
int
) – Print frequency if verbose is Truelog_every (
int
) – Tensorboard log frequency if write directory is specified
 Return type
