Reskit: Researcher Kit for Reproducible Machine Learning Experiments.

Overview

Reskit (researcher’s kit) is a library for creating and curating reproducible pipelines for scientific and industrial machine learning. The natural extension of the scikit-learn Pipelines to general classes of pipelines, Reskit allows for the efficient and transparent optimization of each pipeline step. Main features include data caching, compatibility with most of the scikit-learn objects, optimization constraints (e.g. forbidden combinations), and table generation for quality metrics. Reskit also allows for the injection of custom metrics into the underlying scikit frameworks. Reskit is intended for use by researchers who need pipelines amenable to versioning and reproducibility, yet who also have a large volume of experiments to run.

Features

  • Ability to combine pipelines with equal number of steps in list of experiments, running them and returning results in a convenient format for analysis (Pandas dataframe).
  • Preprocessing steps caching. Usual SciKit-learn pipelines cannot cache temporary steps. We provide an opportunity to save fixed steps, so in next pipeline already calculated steps won’t be recalculated.
  • Ability to set “forbidden combinations” for chosen steps of a pipeline. It helps to test only needed pipelines, not all possible combinations.
  • Full compatibility with scikit-learn objects. It means you can use in Reskit any scikit-learn data transforming object or any predictive model implemented in scikit-learn.
  • Evaluating experiments using several performance metrics.
  • Creation of transformers for your own tasks through DataTransformer class, which allows you to use your functions as data processing steps in pipelines.
  • Tools for machine learning on networks, particularly, for connectomics. Particularly, you can normalize adjacency matrices of graphs and calculate state-of-the-art local metrics using DataTransformer and BCTpy (Brain Connectivity Toolbox python version) or use some implemented in Reskit metrics [3]

Getting started: A Short Introduction to Reskit

Let’s say we want to prepare data and try some scalers and classifiers for prediction in a classification problem. We will tune paramaters of classifiers by grid search technique.

Data preparing:

from sklearn.datasets import make_classification


X, y = make_classification()

Setting steps for our pipelines and parameters for grid search:

from reskit.core import Pipeliner

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


classifiers = [('LR', LogisticRegression()),
               ('SVC', SVC())]

scalers = [('standard', StandardScaler()),
           ('minmax', MinMaxScaler())]

steps = [('scaler', scalers),
         ('classifier', classifiers)]

param_grid = {'LR': {'penalty': ['l1', 'l2']},
              'SVC': {'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}}

Setting a cross-validation for grid searching of hyperparameters and for evaluation of models with obtained hyperparameters.

from sklearn.model_selection import StratifiedKFold


grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

Creating a plan of our research:

pipeliner = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipeliner.plan_table
  scaler classifier
0 standard LR
1 standard SVC
2 minmax LR
3 minmax SVC

To tune parameters of models and evaluate this models, run:

pipeliner.get_results(X, y, scoring=['roc_auc'])
Line: 1/4
Line: 2/4
Line: 3/4
Line: 4/4
  scaler classifier grid_roc_auc_mean grid_roc_auc_std grid_roc_auc_best_params eval_roc_auc_mean eval_roc_auc_std eval_roc_auc_scores
0 standard LR 0.956 0.0338230690506 {‘penalty’: ‘l1’} 0.968 0.0324961536185 [ 0.92 1. 1. 0.94 0.98]
1 standard SVC 0.962 0.0278567765544 {‘kernel’: ‘poly’} 0.976 0.0300665927567 [ 0.95 1. 1. 0.93 1. ]
2 minmax LR 0.964 0.0412795348811 {‘penalty’: ‘l1’} 0.966 0.0377359245282 [ 0.92 1. 1. 0.92 0.99]
3 minmax SVC 0.958 0.0411825205639 {‘kernel’: ‘rbf’} 0.962 0.0401995024845 [ 0.93 1. 1. 0.9 0.98]

Installation

Reskit currently requires Python 3.4 or later to run. Please install Python and pip via the package manager of your operating system if it is not included already.

Reskit depends on:

To install dependencies run next command:

pip install -r https://raw.githubusercontent.com/neuro-ml/reskit/master/requirements.txt

To install stable version, run the following command:

pip install -U https://github.com/neuro-ml/reskit/archive/master.zip

To install latest development version of Reskit, run the following commands:

pip install https://github.com/neuro-ml/reskit/archive/master.zip

Some reskit functions depends on:

You may install it via:

pip install -r https://raw.githubusercontent.com/nuro-ml/reskit/master/requirements_additional.txt

Docker

If you just want to try Reskit or don’t want to install Python, you can build docker image and make all reskit’s stuff there. Also, in this case, you can provide the simple way to reproduce your experiment. To run Reskit in docker you can use next commands.

  1. Clone:
git clone https://github.com/neuro-ml/reskit.git
cd reskit
  1. Build:
docker build -t docker-reskit -f Dockerfile .
  1. Run container.
  1. If you want to run bash in container:
docker run -it docker-reskit bash
  1. If you want to run bash in container with shared directory:
docker run -v $PWD/scripts:/reskit/scripts -it -p 8809:8888 docker-reskit bash

Note

Files won’t be deleted after stopping container if you save this files in shared directory.

  1. If you want to start Jupyter Notebook server at http://localhost:8809 in container:
docker run -v $PWD/scripts:/reskit/scripts -it -p 8809:8888 docker-reskit jupyter notebook --no-browser --ip="*"

Open http://localhost:8809 on your local machine in a web browser.

Tutorial

A central task in machine learning and data science is the comparison and selection of models. The evaluation of a single model is very simple, and can be carried out in a reproducible fashion using the standard scikit pipeline. Organizing the evaluation of a large number of models is tricky; while there are no real theory problems present, the logistics and coordination can be tedious. Evaluating a continuously growing zoo of models is thus an even more painful task. Unfortunately, this last case is also quite common.

Reskit is a Python library that helps researchers manage this problem. Specifically, it automates the process of choosing the best pipeline, i.e. choosing the best set of data transformations and classifiers/regressors. The core of reskit is two classes: Pipeliner and Transformer.

First and second sections describe work of this classes. The third section explains how to use this classes for machine learning on graphs.

You can view all tutorials in format of jupyter notebooks here.

Pipeliner Class Usage

The task is simple: find the best combination of pre-processing steps and predictive models with respect to an objective criterion. Logistically this can be problematic: a small example might involve three classification models, and two data preprocessing steps with two possible variations for each — overall 12 combinations. For each of these combinations we would like to perform a grid search of predefined hyperparameters on a fixed cross-validation dataset, computing performance metrics for each option (for example ROC AUC). Clearly this can become complicated quickly. On the other hand, many of these combinations share substeps, and re-running such shared steps amounts to a loss of compute time.

1. Defining Pipelines Steps and Grid Search Parameters

The researcher specifies the possible processing steps and the scikit objects involved, then Reskit expands these steps to each possible pipeline. Reskit represents these pipelines in a convenient pandas dataframe, so the researcher can directly visualize and manipulate the experiments.

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA

from sklearn.model_selection import StratifiedKFold

from reskit.core import Pipeliner

# Feature selection and feature extraction step variants (1st step)
feature_engineering = [('VT', VarianceThreshold()),
                       ('PCA', PCA())]

# Preprocessing step variants (2nd step)
scalers = [('standard', StandardScaler()),
           ('minmax', MinMaxScaler())]

# Models (3rd step)
classifiers = [('LR', LogisticRegression()),
               ('SVC', SVC()),
               ('SGD', SGDClassifier())]

# Reskit needs to define steps in this manner
steps = [('feature_engineering', feature_engineering),
         ('scaler', scalers),
         ('classifier', classifiers)]

# Grid search parameters for our models
param_grid = {'LR': {'penalty': ['l1', 'l2']},
              'SVC': {'kernel': ['linear', 'poly', 'rbf', 'sigmoid']},
              'SGD': {'penalty': ['elasticnet'],
                      'l1_ratio': [0.1, 0.2, 0.3]}}

# Quality metric that we want to optimize
scoring='roc_auc'

# Setting cross-validations
grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

pipe = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipe.plan_table
  feature_engineering scaler classifier
0 VT standard LR
1 VT standard SVC
2 VT standard SGD
3 VT minmax LR
4 VT minmax SVC
5 VT minmax SGD
6 PCA standard LR
7 PCA standard SVC
8 PCA standard SGD
9 PCA minmax LR
10 PCA minmax SVC
11 PCA minmax SGD

2. Forbidden combinations

In case you don’t want to use minmax scaler with SVC, you can define banned combo:

banned_combos = [('minmax', 'SVC')]
pipe = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid, banned_combos=banned_combos)
pipe.plan_table
  feature_engineering scaler classifier
0 VT standard LR
1 VT standard SVC
2 VT standard SGD
3 VT minmax LR
4 VT minmax SGD
5 PCA standard LR
6 PCA standard SVC
7 PCA standard SGD
8 PCA minmax LR
9 PCA minmax SGD

3. Launching Experiment

Reskit then runs each experiment and presents results which are provided to the user through a pandas dataframe. For each pipeline’s classifier, Reskit grid search on cross-validation to find the best classifier’s parameters and report metric mean and standard deviation for each tested pipeline (ROC AUC in this case).

from sklearn.datasets import make_classification


X, y = make_classification()
pipe.get_results(X, y, scoring=['roc_auc'])
Line: 1/10
Line: 2/10
Line: 3/10
Line: 4/10
Line: 5/10
Line: 6/10
Line: 7/10
Line: 8/10
Line: 9/10
Line: 10/10
  feature_engineering scaler classifier grid_roc_auc_mean grid_roc_auc_std grid_roc_auc_best_params eval_roc_auc_mean eval_roc_auc_std eval_roc_auc_scores
0 VT standard LR 0.98 0.0109544511501 {‘penalty’: ‘l1’} 0.978 0.024 [ 0.99 1. 1. 0.96 0.94]
1 VT standard SVC 0.97 0.0289827534924 {‘kernel’: ‘sigmoid’} 0.972 0.036551333765 [ 1. 1. 1. 0.95 0.91]
2 VT standard SGD 0.968 0.0203960780544 {‘l1_ratio’: 0.3, ‘penalty’: ‘elasticnet’} 0.958 0.0213541565041 [ 0.98 0.92 0.97 0.97 0.95]
3 VT minmax LR 0.98 0.0141421356237 {‘penalty’: ‘l1’} 0.978 0.0203960780544 [ 0.96 1. 1. 0.98 0.95]
4 VT minmax SGD 0.968 0.0193907194297 {‘l1_ratio’: 0.2, ‘penalty’: ‘elasticnet’} 0.966 0.0422374241639 [ 0.99 1. 1. 0.95 0.89]
5 PCA standard LR 0.978 0.0116619037897 {‘penalty’: ‘l1’} 0.982 0.0193907194297 [ 1. 1. 0.99 0.95 0.97]
6 PCA standard SVC 0.958 0.0263818119165 {‘kernel’: ‘sigmoid’} 0.956 0.054258639865 [ 1. 1. 1. 0.88 0.9 ]
7 PCA standard SGD 0.918 0.0426145515053 {‘l1_ratio’: 0.3, ‘penalty’: ‘elasticnet’} 0.94 0.0433589667774 [ 0.98 0.96 0.97 0.86 0.93]
8 PCA minmax LR 0.97 0.0352136337233 {‘penalty’: ‘l2’} 0.936 0.0705974503789 [ 1. 1. 0.97 0.82 0.89]
9 PCA minmax SGD 0.946 0.032 {‘l1_ratio’: 0.1, ‘penalty’: ‘elasticnet’} 0.934 0.0697423830967 [ 1. 1. 0.97 0.84 0.86]

4. Caching intermediate steps

Reskit also allows you to cache interim calculations to avoid unnecessary recalculations.

from sklearn.preprocessing import Binarizer

# Simple binarization step that we want ot cache
binarizer = [('binarizer', Binarizer())]

# Reskit needs to define steps in this manner
steps = [('binarizer', binarizer),
         ('classifier', classifiers)]

pipe = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipe.plan_table
  binarizer classifier
0 binarizer LR
1 binarizer SVC
2 binarizer SGD
pipe.get_results(X, y, caching_steps=['binarizer'])
Line: 1/3
Line: 2/3
Line: 3/3
  binarizer classifier grid_accuracy_mean grid_accuracy_std grid_accuracy_best_params eval_accuracy_mean eval_accuracy_std eval_accuracy_scores
0 binarizer LR 0.92 0.0244948974278 {‘penalty’: ‘l1’} 0.92 0.0244948974278 [ 0.95 0.9 0.95 0.9 0.9 ]
1 binarizer SVC 0.92 0.0244948974278 {‘kernel’: ‘rbf’} 0.92 0.0244948974278 [ 0.95 0.9 0.95 0.9 0.9 ]
2 binarizer SGD 0.85 0.0894427191 {‘l1_ratio’: 0.2, ‘penalty’: ‘elasticnet’} 0.82 0.0812403840464 [ 0.9 0.85 0.9 0.75 0.7 ]

Last cached calculations stored in _cached_X

pipe._cached_X
OrderedDict([('init',
              array([[-0.34004591,  0.07223225, -0.10297704, ...,  1.55809216,
                      -1.84967225,  1.20716726],
                     [-0.61534739, -0.2666859 , -1.21834152, ..., -1.31814689,
                       0.97544639, -1.21321157],
                     [ 1.08934663,  0.12345205,  0.09360395, ..., -0.50379748,
                      -0.03416718,  1.51609726],
                     ...,
                     [-1.06428161, -0.22220536, -2.87462458, ..., -0.17236827,
                      -0.22141068,  2.76238087],
                     [ 0.40555432,  0.12063241,  1.1565546 , ...,  1.71135941,
                       0.29149897, -0.67978708],
                     [-0.47521282,  0.11614697,  0.45649735, ..., -0.15355913,
                       0.19643313,  0.67876913]])),
             ('binarizer', array([[ 0.,  1.,  0., ...,  1.,  0.,  1.],
                     [ 0.,  0.,  0., ...,  0.,  1.,  0.],
                     [ 1.,  1.,  1., ...,  0.,  0.,  1.],
                     ...,
                     [ 0.,  0.,  0., ...,  0.,  0.,  1.],
                     [ 1.,  1.,  1., ...,  1.,  1.,  0.],
                     [ 0.,  1.,  1., ...,  0.,  1.,  1.]]))])

Transformers Usage

This tutorial helps you to understand how you can transform your data using DataTransformer and MatrixTransformer classes and how to make your own classes for data transformation.

1. MatrixTransformer

import numpy as np

from reskit.normalizations import mean_norm
from reskit.core import MatrixTransformer

matrix_0 = np.random.rand(5, 5)
matrix_1 = np.random.rand(5, 5)
matrix_2 = np.random.rand(5, 5)
y = np.array([0, 0, 1])

X = np.array([matrix_0,
              matrix_1,
              matrix_2])

output = np.array([mean_norm(matrix_0),
                   mean_norm(matrix_1),
                   mean_norm(matrix_2)])

result = MatrixTransformer(
            func=mean_norm).fit_transform(X)

(output == result).all()
True

This is a simple example of MatrixTransformer usage. Input X for transformation with MatrixTransformer should be a 3 dimensional array (array of matrices). So, MatrixTransformer just transforms each matrix in X.

If you have a data with specific data structure it is useful and convenient to write your function for data processing.

2. DataTransformer

To simply write new transformers we provide DataTransformer. The main idea is to write functions which takes some X and output transformed X. Thus, you shouldn’t write a transformation class for compatibility with sklearn pipelines. So, here is example of DataTransformer usage:

from reskit.core import DataTransformer


def mean_norm_trans(X):
    X = X.copy()
    N = len(X)
    for i in range(N):
        X[i] = mean_norm(X[i])
    return X

result = DataTransformer(
            func=mean_norm_trans).fit_transform(X)

(output == result).all()
True

As you can see, we writed the same transformation, but with DataTransformer instead of MatrixTransformer.

3. Your own transformer

If you need more flexibility in transformation, you can implement your own transformer. Simplest example:

from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator

class MyTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X, y=None, **fit_params):
        #
        # Write here the code if transformer need
        # to learn anything from data.
        #
        # Usually nothing should be here,
        # just return self.
        #
        return self

    def transform(self, X):
        #
        # Write here your transformation
        #
        return X

Machine Learning on Graphs

We already used some graph metrics in the previous tutorial. There we will cover graphs metrics and features in details. Also, we will cover Brain Connectivity Toolbox usage.

1. Realworld dataset

Here we use UCLA autism dataset publicly available at the UCLA Multimodal Connectivity Database. Data includes DTI-based connectivity matrices of 51 high-functioning ASD subjects (6 females) and 43 TD subjects (7 females).

from reskit.datasets import load_UCLA_data


X, y = load_UCLA_data()
X = X['matrices']

2. Normalizations and Graph Metrics

We can normalize and build some metrics.

from reskit.normalizations import mean_norm
from reskit.features import bag_of_edges
from reskit.core import MatrixTransformer


normalized_X = MatrixTransformer(
    func=mean_norm).fit_transform(X)

featured_X = MatrixTransformer(
    func=bag_of_edges).fit_transform(normalized_X)

3. Brain Connectivity Toolbox

We provide some basic graph metrics in Reskit. To access most state of the art graph metrics you can use Brain Connectivity Toolbox. You should install it via pip:

sudo pip install bctpy

Let’s calculate pagerank centrality of a random graph using BCT python library.

from bct.algorithms.centrality import pagerank_centrality
import numpy as np


pagerank_centrality(np.random.rand(3,3), d=0.85)
array([ 0.46722034,  0.33387522,  0.19890444])

Now we calculates this metric for UCLA dataset. d is the pagerank_centrality parameter, called damping factor (see bctpy documentation for more info).

featured_X = MatrixTransformer(
    d=0.85,
    func=pagerank_centrality).fit_transform(X)

If we want to try pagerank_centrality and degrees for SVM and LogisticRegression classfiers.

from bct.algorithms.degree import degrees_und

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold

from reskit.core import Pipeliner

# Feature extraction step variants (1st step)
featurizers = [('pagerank', MatrixTransformer(
                                d=0.85,
                                func=pagerank_centrality)),
               ('degrees', MatrixTransformer(
                                func=degrees_und))]

# Models (3rd step)
classifiers = [('LR', LogisticRegression()),
               ('SVC', SVC())]

# Reskit needs to define steps in this manner
steps = [('featurizer', featurizers),
         ('classifier', classifiers)]

# Grid search parameters for our models
param_grid = {'LR': {'penalty': ['l1', 'l2']},
              'SVC': {'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}}

# Quality metric that we want to optimize
scoring='roc_auc'

# Setting cross-validations
grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

pipe = Pipeliner(steps=steps, grid_cv=grid_cv, eval_cv=eval_cv, param_grid=param_grid)
pipe.plan_table
  featurizer classifier grid_roc_auc_mean grid_roc_auc_std grid_roc_auc_best_params eval_roc_auc_mean eval_roc_auc_std eval_roc_auc_scores
0 pagerank LR 0.584141951429 0.0942090541588 {‘penalty’: ‘l2’} 0.639191919192 0.0805917875518 [ 0.5959596 0.76666667 0.63333333 0.675 0.525 ]
1 pagerank SVC 0.605372877713 0.144537957686 {‘kernel’: ‘linear’} 0.611919191919 0.104864911084 [ 0.62626263 0.75555556 0.57777778 0.6625 0.4375 ]
2 degrees LR 0.622343111971 0.0883996599293 {‘penalty’: ‘l1’} 0.567676767677 0.0721669280455 [ 0.61616162 0.55555556 0.46666667 0.675 0.525 ]
3 degrees SVC 0.572662798195 0.0409233652853 {‘kernel’: ‘poly’} 0.542752525253 0.0751127269022 [ 0.62626263 0.5 0.5 0.6375 0.45 ]
pipe.get_results(X, y, scoring=scoring, caching_steps=['featurizer'])
Line: 1/4
Line: 2/4
Line: 3/4
Line: 4/4
  featurizer classifier grid_roc_auc_mean grid_roc_auc_std grid_roc_auc_best_params eval_roc_auc_mean eval_roc_auc_std eval_roc_auc_scores
0 pagerank LR 0.584141951429 0.0942090541588 {‘penalty’: ‘l2’} 0.639191919192 0.0805917875518 [ 0.5959596 0.76666667 0.63333333 0.675 0.525 ]
1 pagerank SVC 0.605372877713 0.144537957686 {‘kernel’: ‘linear’} 0.611919191919 0.104864911084 [ 0.62626263 0.75555556 0.57777778 0.6625 0.4375 ]
2 degrees LR 0.622343111971 0.0883996599293 {‘penalty’: ‘l1’} 0.567676767677 0.0721669280455 [ 0.61616162 0.55555556 0.46666667 0.675 0.525 ]
3 degrees SVC 0.572662798195 0.0409233652853 {‘kernel’: ‘poly’} 0.542752525253 0.0751127269022 [ 0.62626263 0.5 0.5 0.6375 0.45 ]

This is the main things about maching learning on graphs. Now you can try big amount of normalizations features and classifiers for graphs classifcation. In case you need something specific you can implement temporary pipeline step to fiegure out the influence of this step on the result.

Reference

Core

Core classes.

Pipeliner(steps, grid_cv, eval_cv[, ...]) An object which allows you to test different data preprocessing pipelines and prediction models at once.
MatrixTransformer(func, **params) Helps to add you own transformation through usual functions.
DataTransformer(func, **params) Helps to add you own transformation through usual functions.

Norms

Functions of norms.

binar_norm(X) Binary matrix normalization.
max_norm(X) Maximum matrix normalization.
mean_norm(X) Mean matrix normalization.
spectral_norm(X) Spectral matrix normalization.
rwalk_norm(X) Random walk matrix normalization.
double_norm(function, X1, X2) Double normalization.
sqrtw(X) Square root matrix normalization.
invdist(X, dist) Inverse distance matrix normalization.
rootwbydist(X, dist) Root weight by distance matrix normalization.
wbysqdist(X, dist) Weights by squared distance matrix normalization.

Features

closeness_centrality(X) Closeness centrality graph metric.
betweenness_centrality(X) Betweenness centrality graph metric.
eigenvector_centrality(X) Eigenvector centrality graph metric.
pagerank(X) Pagerank graph metric.
clustering_coefficient(X) Clustering coefficient graph metric.
triangles(X) Triangles graph metric.
degrees(X) Degree graph metric.
efficiency(X) Efficiency graph metric.
bag_of_edges(X[, SPL, symmetric, return_df, ...]) Bag of edges graph metric.

Indices and tables