CircleCI Documentation status PyPI version

Wolpert, a stacked generalization framework

Wolpert is a scikit-learn compatible framework for easily building stacked ensembles. It supports:

  • Different stacking strategies
  • Multi-layer models
  • Different weights for each transformer
  • Easy to make it distributed

Quickstart

Install

The easiest way to install is using pip. Just run pip install wolpert and you’re ready to go.

Building a simple model

First we need the layers of our model. The simplest way is using the helper function make_stack_layer:

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from wolpert import make_stack_layer, StackingPipeline

layer0 = make_stack_layer(SVC(), KNeighborsClassifier(),
                          RandomForestClassifier(),
                          blending_wrapper='holdout')

clf = StackingPipeline([('l0', layer0),
                        ('l1', LogisticRegression())])

And that’s it! And StackingPipeline inherits a scikit learn class: the Pipeline, so it works just the same:

clf.fit(Xtrain, ytrain)
ypreds = clf.predict_proba(Xtest)

This is just the basic example, but there are several ways of building a stacked ensemble with this framework. Make sure to check the User Guide to know more.

User Guide

Stacked generalization is another method of combining estimators to reduce their biases [W1992] by combining several estimators (possibly non-linearly) stacked together in layers. Each layer will contain estimators and their predictions are used as features to the next layer.

As stacked generalization is a generic framework for combining supervised estimators, it works with regression and classification problems. The API reflects that, so it’s the same for both categories.

The intent of this user guide is to serve both as an introduction to stacked generalization and as a manual for using the framework.

Introduction to stacked generalization

Note

If you’re already familiar with stacked generalization, you can skip this part and go straight to the usage chapter, where we’ll discuss how to build stacked ensembles with Wolpert.

From the original paper, stacked generalization is “…a scheme for minimizing the generalization error rate of one or more generalizers. Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess” [W1992]. Basically what this means is training a set of estimators on a dataset, generating predictions off of them and training another estimator on those predictions. Here’s an example:

_images/stack_example_001.png

If you look at the image, it resembles a neural network. The edges may even have weights just like a neural network (check the transformer_weights parameter on the StackingLayer class) and also be deeper. The problem with stacked generalization is that, as models aren’t differentiable, we can’t train them using something like gradient descent. Instead, we build each layer one by one. To be able to generate a data set for the next layer on the model, we need to basically run a cross validation on the previous layer and use the predictions for the holdout sets as the new training set. The act of generating a new training set from the estimator’s cross validation predictions is called by some as blending [MLW2015]. There are several strategies for this step, so if you want to learn more about it, check the stacking strategies chapter.

Suppose we are building a stacked ensemble with two layers and the chosen blending method uses a 2-fold cross validation scheme. The basic algorithm is as follows:

  1. Split the training set in 2 parts;
  2. Train each estimator on the first layer using the first part of the training set and create predictions for the second part
  3. Train each estimator on the first layer using the second part of the training set and create predictions for the first part
  4. Use these predictions to train the estimators on the next layer
  5. Train each of the first layers estimators with the whole training set

The interesting part is the final estimator should perform at least as well as the best estimator used on the inner layers of our model [W1992].

Restacking

The stacked generalization framework is quite flexible, so we can play around with some architectures. One example that may help improve a stacked ensemble performance is restacking [MM2017]: we pass the training set unchanged from one layer to the other.

_images/restack_graph.png

This may improve the stacked ensemble performance in some cases, specially for more complicated ensembles with multiple layers. A good example that uses multiple layers and restacking in practice is Kaggle’s 2015 Dato competition winner.

References

[W1992](1, 2)
    1. Wolpert, “Stacked Generalization”, Neural Networks, Vol. 5, No. 5, 1992.
[MLW2015]Hendrik Jacob van Veen, Le Nguyen The Dat, Armando Segnini. 2015. Kaggle Ensembling Guide. [accessed 2018 Jul 16]. https://mlwave.com/kaggle-ensembling-guide/
[MM2017]Michailidis, Marios; (2017) Investigating machine learning methods in recommender systems. Doctoral thesis (Ph.D), UCL (University College London).

Stacking strategies

Currently wolpert supports two strategies, each with its pros and cons:

Stacking with cross validation

This strategy, implemented in the class wrappers.CVStackableTransformer uses the predictions from cross validation to build the next data set. This means all the samples from the training set will be available to the next layer:

>>> import numpy as np
>>> from wolpert.wrappers import CVStackableTransformer
>>> from sklearn.linear_model import LogisticRegression
>>> X = np.asarray([[1, 2], [3, 4], [5, 6], [7, 8]])
>>> y = np.asarray([0, 1, 0, 1])
>>> wrapped_clf = CVStackableTransformer(LogisticRegression(random_state=1), cv=2)
>>> wrapped_clf.fit_blend(X, y)  
(array([[0.52444526, 0.47555474],
        [0.48601904, 0.51398096],
        [0.15981917, 0.84018083],
        [0.08292124, 0.91707876]]), ...)

Note

The first argument returned by blend / fit_blend is the transformed training set and the second one is the indexes of the rows present on this transformed data, but don’t worry about the second argument now.

As you can see, the data transformed by blending has the same number of rows as the input. For this estimator, this should always be true. This is good because we’ll have more data to train the subsequent layers, but it comes with a downside: as we fit the layer to the whole training set after blending, the probability distribution for the transformed data set will change from train to test. But don’t worry too much: in practice the results are still good.

In multi-layer stackings, this may be the only choice. This is because if we choose another strategy, our training set will become exponentially smaller from layer to layer.

Stacking with holdout set

When the training set is too big, using a k-fold split may be too slow. For this cases, we have wrappers.HoldoutStackableTransformer. This strategy splits the data into two sets: training and holdout. The models are trained using the training set and outputs predictions for the holdout set. This means we’ll have fewer rows to train the subsequent layers. See the following example:

>>> import numpy as np
>>> from wolpert.wrappers import HoldoutStackableTransformer
>>> from sklearn.linear_model import LogisticRegression
>>> X = np.asarray([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
>>> y = np.asarray([0, 1, 0, 1, 1])
>>> wrapped_clf = HoldoutStackableTransformer(LogisticRegression(random_state=1),
...                                           holdout_size=.5)
>>> wrapped_clf.fit_blend(X, y)  
 (array([[0.34674758, 0.65325242],
         [0.0649691 , 0.9350309 ],
         [0.21229721, 0.78770279]]),
  array([1, 4, 2]))

As you can see from the indexes array, only predictions for rows 1, 2 and 4 were returned on the transformed data set. This will be faster than wrappers.CVStackableTransformer and, if fit_to_all_data is set to False, train and test sets will come from the same probability distribution.

Stacking with time series

When dealing with time series data, extra care must be taken to avoid leakages. wrappers.TimeSeriesStackableTransformer handles part of this issue by making splits that never violate the original ordering of the data or, in other words, indexes on the training set will always be smaller than indexes on the test set for all splits.

It works by walking in an ascending order, growing the training set on each split and predicting on the data after the training set. It’s almost the same as sklearn’s TimeSeriesSplit, but with some knobs that we found more useful. Here’s an example:

>>> import numpy as np
>>> from wolpert.wrappers import TimeSeriesStackableTransformer
>>> from sklearn.linear_model import LogisticRegression
>>> X = np.asarray([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
>>> y = np.asarray([0, 1, 0, 1, 1])
>>> wrapped_clf = TimeSeriesStackableTransformer(LogisticRegression(random_state=1),
...                                              min_train_size=2)
>>> wrapped_clf.fit_blend(X, y)  
 (array([[0.15981917, 0.84018083],
         [0.74725218, 0.25274782],
         [0.26388084, 0.73611916]]),
  array([2, 3, 4]))

These were the splits used to generate the blended data set:

  1. Train on indexes 0 and 1, predict for index 2;
  2. Train on indexes 0, 1 and 2, predict for index 3.

This resembles the leave-one-out cross validation, but wrappers.TimeSeriesStackableTransformer provides other options, so make sure to check its documentation. For example, to make a blended dataset that resembles leave-p-out cross-validation, all you have to do is change the wrappers.TimeSeriesStackableTransformer.test_set_size:

>>> import numpy as np
>>> from wolpert.wrappers import TimeSeriesStackableTransformer
>>> from sklearn.linear_model import LogisticRegression
>>> X = np.asarray([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
>>> y = np.asarray([0, 1, 0, 1, 1, 0])
>>> wrapped_clf = TimeSeriesStackableTransformer(LogisticRegression(random_state=1),
...                                              min_train_size=2,
...                                              test_set_size=2)
>>> wrapped_clf.fit_blend(X, y)  
 (array([[0.15981917, 0.84018083],
         [0.08292124, 0.91707876],
         [0.26388084, 0.73611916],
         [0.20981736, 0.79018264]]),
  array([2, 3, 4, 5]))

Note

Notice that in the last example the last sample was dropped from the transformed data. This is because, when the remaining samples are not enough to satisfy the test set size constraint, they are dropped.

Using wolpert to build stacked ensembles

We’ll build a stacked ensemble for a classification task. Let’s start by loading our data:

from sklearn.datasets import make_classification
RANDOM_STATE = 888
X, y = make_classification(n_samples=1000, random_state=RANDOM_STATE)

Now let’s choose some base models to build our first layer. We’ll go with a KNN, SVM, random forest and extremely randomized trees, all available on scikit learn. It’s worth noting that any scikit learn compatible model can be used here:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

knn = KNeighborsClassifier()
svc = SVC(random_state=RANDOM_STATE, probability=True)
rf = RandomForestClassifier(random_state=RANDOM_STATE)
et = ExtraTreesClassifier(random_state=RANDOM_STATE)

Now let’s test each classifier alone and see what we get. We’ll use a cross validation with 3 folds for this and evaluate using log loss.

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss

def evaluate(clf, clf_name, X, y):
    kfold = StratifiedKFold(n_splits=3, random_state=RANDOM_STATE)
    scores = []
    for train_idx, test_idx in kfold.split(X, y):
        ypreds =  clf.fit(X[train_idx], y[train_idx]).predict_proba(X[test_idx])
        scores.append(log_loss(y[test_idx], ypreds))
    print("Logloss for %s: %.5f (+/- %.5f)" % (clf_name, np.mean(scores), np.std(scores)))
    return scores

evaluate(knn, "KNN classifier", X, y)
evaluate(rf, "Random Forest", X, y)
evaluate(svc, "SVM classifier", X, y)
evaluate(et, "ExtraTrees", X, y)
Logloss for KNN classifier: 0.65990 (+/- 0.10233)
Logloss for Random Forest: 0.47338 (+/- 0.21536)
Logloss for SVM classifier: 0.24082 (+/- 0.02127)
Logloss for ExtraTrees: 0.53194 (+/- 0.08710)

The best model here is the SVM. We now have a baseline to build our stacked ensemble.

The first thing that needs to be decided is the stacking strategy we’ll use. The dataset is pretty small, so it’s ok to go for a cross validation strategy.

Note

To know more about the strategies implemented in wolpert, read the strategies chapter.

The easiest way to do so is using the helper function pipeline.make_stack_layer(). This function takes a list of steps to be used to build a layer and the blending wrapper.

from wolpert import make_stack_layer

layer0 = make_stack_layer(knn, rf, svc, et, blending_wrapper='cv')

Ok, now that we have our first layer, let’s put a very simple model on top of it and see how it goes. For validating the meta estimator, we must first generate the blended dataset (see pipeline.StackingLayer.fit_blend() for more info):

Xt, t_indexes = layer0.fit_blend(X, y)
yt = y[t_indexes]

Now we can build our meta estimator and evaluate it:

from sklearn.linear_model import LogisticRegression

meta = LogisticRegression(random_state=RANDOM_STATE)

evaluate(meta, "Meta estimator", Xt, yt)
Logloss for Meta estimator: 0.22706 (+/- 0.02656)

Notice the score is already better than our best classifier on the first layer. Now let’s construct the final model. To do this we’ll use the class pipeline.StackingPipeline. This acts like scikit learn’s Pipeline: each output from a step is piped to the next step. The difference is that StackingPipeline will use blending when fitting the models to a dataset.

from wolpert import StackingPipeline

stacked_clf = StackingPipeline([("l0", layer0), ("meta", meta)])

Note

The final class has a helper method for evaluating it, called score, but it depends on scikit learn’s cross_validate function and this function doesn’t allow us to pass the method we want it to call on the estimator, always calling predict. Here’s an example:

stacked_clf.fit(X, y)
scores = stacked_clf.score(X, y, scoring='neg_log_loss', cv=3)

print("Logloss for Stacked classifier: %.5f (+/- %.5f)" % (-np.mean(scores["test_score"]),
                                                           np.std(scores["test_score"])))
Logloss for Stacked classifier: 0.28145 (+/- 0.03106)

Notice the score is worse than our handcrafted evaluation.

Now let’s see how we can improve our model.

Multi-layer stacked ensemble

Let’s try a simple approach: we’ll grab the best two models from the first layer and create a second one. We’ll also use restacking on the first layer. The final meta estimator will remain the same.

layer0_clfs = [KNeighborsClassifier(),
               SVC(random_state=RANDOM_STATE, probability=True),
               RandomForestClassifier(random_state=RANDOM_STATE),
               ExtraTreesClassifier(random_state=RANDOM_STATE)]

layer1_clfs = [SVC(random_state=RANDOM_STATE, probability=True),
               RandomForestClassifier(random_state=RANDOM_STATE)]



layer0 = make_stack_layer(*layer0_clfs, blending_wrapper="cv", restack=True)
layer1 = make_stack_layer(*layer1_clfs, blending_wrapper="cv")

# first let's build the pipeline without the final estimator to see its
# performance
transformer = StackingPipeline([("layer0", layer0), ("layer1", layer1)])
Xt, t_indexes = transformer.fit_blend(X, y)

evaluate(meta, "Meta classificator with two layers", X, y)
Logloss for Meta classificator with two layers: 0.28145 (+/- 0.03106)

Well, it didn’t help. Let’s keep the old model then. There are some reasons for this: maybe our model is too complex for the dataset, so a single layer is better.

Model selection

We can access all attributes on all estimators just like in a regular scikit learn pipeline. With that we can follow the same steps for model selection:

from sklearn.model_selection import GridSearchCV

param_grid = {
    "l0__svc__method": ["predict_proba", "decision_function"],
    "l0__svc__estimator__C": [.1, 1., 10]}

clf_cv = GridSearchCV(stacked_clf, param_grid, scoring="neg_log_loss", n_jobs=-1)
clf_cv.fit(X, y)
test_scores = clf_cv.cv_results_["mean_test_score"]
print("Logloss for best model on CV: %.5f (+/- %.5f)" % (-test_scores.mean(), test_scores.std()) )
Logloss for best model on CV: 0.22491 (+/- 0.00259)

Note

Remember that this score should be compared to the one from the score method.

Wrappers API

Up until now we relied on the default arguments for wrapping our models. To have more control over this arguments, one can use the wolpert.wrappers API. Let’s build our model now using a 10-fold cross validation. For this, we’ll use the CVWrapper helper class.

from wolpert.wrappers import CVWrapper

cv_wrapper = CVWrapper(cv=10, n_cv_jobs=-1)

The main method for this class is wrap_estimator, that receives an estimator and returns it wrapped with a class that exposes the methods blend and fit_blend. We can also pass it to the wolpert.make_stack_layer.blending_wrapper argument and it will be used to wrap all the estimators on the layer:

layer0 = make_stack_layer(knn, rf, svc, et, blending_wrapper=cv_wrapper)
stacked_clf = StackingPipeline([("l0", layer0), ("meta", meta)])

Just out of curiosity, here’s the model performance:

Xt, t_indexes = layer0.fit_blend(X, y)

evaluate(meta, "Meta classificator with CV=10 on first layer", Xt, y[t_indexes])
Logloss for Meta classificator with CV=10 on first layer: 0.22241 (+/- 0.03292)
Inner estimators performance

Sometimes it’s useful to keep track of the performance of each estimator inside an ensemble. To do so, every wrapper exposes a parameter called scoring. It works simmilarly to scikit learn’s scoring parameter, but it uses the metrics functions directly instead of a scorer. We do so because we want to avoid retraining models inside an ensemble, as it’s already an expensive computation as it is.

When scoring is set, everytime a blend happens, it will store the scoring results in the scores_ parameter. It’s a list of dicts where each key is the name of the score used. If not supplied, the name will be score with an integer suffix.

Each metric may be a string (for the builtin metrics) or a function that receives the true labels and the predicted labels and outputs a single floating number, denoting the score for this pair. The scoring parameters accepts a single metric, a list of metrics or a dict where the key is the metric name and the value is the metric itself.

import numpy as np

from wolpert.wrappers import CVStackableTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X = np.asarray([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.asarray([0, 1, 0, 1])

# With a single metric
cvs = CVStackableTransformer(
    LinearRegression(), scoring='mean_absolute_error')
cvs.fit_blend(X, y)
print(cvs.scores_)

# a list of metrics
cvs = CVStackableTransformer(
    LinearRegression(), scoring=['mean_absolute_error',
                                 mean_squared_error])
cvs.fit_blend(X, y)
print(cvs.scores_)

# a dict of metrics
cvs = CVStackableTransformer(
    LinearRegression(), scoring={'mae': 'mean_absolute_error',
                                 'mse': mean_squared_error})
cvs.fit_blend(X, y)
import pprint
pprint.pprint(cvs.scores_)
[{'score': 1.380...}]
[{'score': 1.380..., 'score1': 2.294...}]
[{'mae': 1.380..., 'mse': 2.294...}]

We can also use the verbose parameter to keep track of the models performances. It will print the results to stdout.

cvs = CVStackableTransformer(
    LinearRegression(), scoring='mean_absolute_error', verbose=True)
cvs.fit_blend(X, y)
[BLEND] cv=3, estimator=<class
        'sklearn.linear_model.base.LinearRegression'>,
        estimator__copy_X=True, estimator__fit_intercept=True,
        estimator__n_jobs=1, estimator__normalize=False, method=auto,
        n_cv_jobs=1, scoring=mean_absolute_error, verbose=True
 - scores 0: score=1.380...

References

[W1992]
    1. Wolpert, “Stacked Generalization”, Neural Networks, Vol. 5, No. 5, 1992.

API documentation

wolpert.pipeline: Pipeline classes and utility functions

Classes
StackingLayer(transformer_list[, n_jobs, …]) Creates a single layer for the stacked ensemble.
StackingPipeline(steps[, memory]) A pipeline of ``StackingLayer``s with a final estimator.
Functions
make_stack_layer(*estimators, **kwargs) Creates a single stack layer to be used in a stacked ensemble.

wolpert.wrappers: Wrappers

Classes
HoldoutWrapper([default_method, …]) Helper class to wrap estimators with HoldoutStackableTransformer
CVWrapper([default_method, default_scoring, …]) Helper class to wrap estimators with CVStackableTransformer
TimeSeriesWrapper([default_method, …]) Helper class to wrap estimators with TimeSeriesStackableTransformer
HoldoutStackableTransformer(estimator[, …]) Transformer to turn estimators into meta-estimators for model stacking
CVStackableTransformer(estimator[, method, …]) Transformer to turn estimators into meta-estimators for model stacking
TimeSeriesStackableTransformer(estimator[, …]) Transformer to turn estimators into meta-estimators for model stacking

Indices and tables