Welcome to AIkit’s documentation!

aikit stands for Artificial Intelligent tool Kit and provides method to facilitate and accelerate the DataScientist job.

The optic is to provide tools to ease the repetitive part of the DataScientist job and so that he/she can focus on modelization. This package is still in alpha and more features will be added.

This library is intended for the user who knows machine learning, knows python its data-science environnement (sklearn, numpy, pandas, …) but doesn’t want to spend too much time thinking about python technicallities in order to focus more on modelling. The idea is to automatize or at least accelerate parts of the DataScientist job so that he/she can focus on what he/she does best. The more time spend on coding the less time spent on asking the rights questions and solving the problems. It will also help to really use model in production and not just play with them.

The library is usefull if you ever asked yourself that type of questions :
  • How do I handle different type of data ?
  • I don’t remember how to concatenate sparse array and dataframe ?
  • How can I retrieve the name of my features now that everything is a numpy array ?
  • I’d like to use sklearn but my data is in a DataFrame with strings object and I don’t want to use 2 transformers just to encode the categorical features ?
  • How do I deal with Data with several types like text, number and categorical data ?
  • How can I quickly test models to see what work and what doesn’t ?
Here a quick summary of what is provided:
  • additional sklearn-like transformers to facilitate operations (categories encoding, missing value handling, text encoding, …) : Transformer
  • an extension of sklearn Pipeline that handle generic composition of transformations : GraphPipeline
  • a framework to automatically test machine learning models : Auto ML Overview
  • helper functions to accelerate the day-to-day

Installation

Using pip:

pip install aikit

In order to use the full functionnalities of aikit you can also install additionnal packages :

  • graphviz : to have a nice representation of the graph of models
  • lightgbm : to use lightgbm in the auto-ml
  • nltk and gensim : to have advanced text encoder
  • nltk corpus to clean text

To install everything you can do the following:

pip install lightgbm
pip install gensim
pip install nltk
python -m nltk.downloader punkt
python -m nltk.downloader stopwords
conda install graphviz

Getting Started

This notebook will show you how to built a complexe pipeline using aikit and how to crossvalidated it

[3]:
from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
Xtrain.head(10)
[3]:
pclass name sex age sibsp parch ticket fare cabin embarked boat body home_dest
0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S NaN 175.0 Dorchester, MA
1 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB
2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN NaN
3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 39.6875 NaN S NaN NaN NaN
4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S 8 NaN NaN
5 3 Waelens, Mr. Achille male 22.0 0 0 345767 9.0000 NaN S NaN NaN Antwerp, Belgium / Stanton, OH
6 3 Reed, Mr. James George male NaN 0 0 362316 7.2500 NaN S NaN NaN NaN
7 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S 8 NaN Brooklyn, NY
8 1 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) female 18.0 1 0 13695 60.0000 C31 S 6 NaN Huntington, WV
9 1 Rowe, Mr. Alfred G male 33.0 0 0 113790 26.5500 NaN S NaN 109.0 London
[4]:
y_train[0:10]
[4]:
array([0, 0, 1, 0, 1, 0, 0, 1, 1, 0], dtype=int64)
[13]:
from aikit.pipeline import GraphPipeline
from aikit.transformers import ColumnsSelector, NumericalEncoder, NumImputer, CountVectorizerWrapper
from sklearn.ensemble import RandomForestClassifier

text_cols     = ["name","ticket"]
non_text_cols = [c for c in Xtrain.columns if c not in text_cols]

gpipeline = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),("vect","rf")])

gpipeline.fit(Xtrain,y_train)
gpipeline.graphviz
[13]:
_images/notebooks_GettingStarted_3_0.svg
[17]:
from aikit.cross_validation import cross_validation
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(10, shuffle=True, random_state=123)

cv_res, yhat_proba = cross_validation(gpipeline, Xtrain, y_train,cv=cv, scoring=["accuracy", "roc_auc", "neg_log_loss"], return_predict=True, method="predict_proba")

cv_res
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   13.3s finished
[17]:
test_accuracy test_roc_auc test_neg_log_loss train_accuracy train_roc_auc train_neg_log_loss fit_time score_time n_test_samples fold_nb
0 0.980952 0.998095 -0.116852 1.0 1.0 -0.043928 0.841272 0.133642 105 0
1 0.961905 0.969512 -0.484764 1.0 1.0 -0.042055 0.903584 0.126663 105 1
2 0.961905 0.994474 -0.139215 1.0 1.0 -0.041864 0.747745 0.117702 105 2
3 0.990476 1.000000 -0.102579 1.0 1.0 -0.042316 0.735497 0.120638 105 3
4 0.952381 0.994284 -0.130034 1.0 1.0 -0.044032 0.772531 0.119271 105 4
5 0.961905 0.996570 -0.134116 1.0 1.0 -0.041499 0.725558 0.126695 105 5
6 0.971429 0.998476 -0.140661 1.0 1.0 -0.047236 0.761099 0.116698 105 6
7 0.961905 0.995617 -0.155353 1.0 1.0 -0.040947 0.734543 0.111288 105 7
8 0.961538 0.994386 -0.132630 1.0 1.0 -0.041335 0.740471 0.111782 104 8
9 0.980769 0.996903 -0.150282 1.0 1.0 -0.044830 0.749113 0.112777 104 9
[18]:
yhat_proba.head(10)
[18]:
0 1
0 1.00 0.00
1 0.88 0.12
2 0.05 0.95
3 0.93 0.07
4 0.07 0.93
5 1.00 0.00
6 1.00 0.00
7 0.03 0.97
8 0.06 0.94
9 0.98 0.02

Using cross_validation you get in one call :

  • both train and test score
  • all the metrics
  • the probabilities predicted for each observation

Examples

GraphPipeline getting started

This notebook is here to show a few things that can be done by the package.

It doesn’t means that these are the things you should do on that particular dataset.

Let’s load titanic dataset to test a few things

[1]:
import warnings
warnings.filterwarnings('ignore') # to remove gensim warning
[2]:
from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
[3]:
Xtrain.head(20)
[3]:
pclass name sex age sibsp parch ticket fare cabin embarked boat body home_dest
0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S NaN 175.0 Dorchester, MA
1 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB
2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN NaN
3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 39.6875 NaN S NaN NaN NaN
4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S 8 NaN NaN
5 3 Waelens, Mr. Achille male 22.0 0 0 345767 9.0000 NaN S NaN NaN Antwerp, Belgium / Stanton, OH
6 3 Reed, Mr. James George male NaN 0 0 362316 7.2500 NaN S NaN NaN NaN
7 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S 8 NaN Brooklyn, NY
8 1 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) female 18.0 1 0 13695 60.0000 C31 S 6 NaN Huntington, WV
9 1 Rowe, Mr. Alfred G male 33.0 0 0 113790 26.5500 NaN S NaN 109.0 London
10 3 Meo, Mr. Alfonzo male 55.5 0 0 A.5. 11206 8.0500 NaN S NaN 201.0 NaN
11 3 Abbott, Mr. Rossmore Edward male 16.0 1 1 C.A. 2673 20.2500 NaN S NaN 190.0 East Providence, RI
12 3 Elias, Mr. Dibo male NaN 0 0 2674 7.2250 NaN C NaN NaN NaN
13 2 Reynaldo, Ms. Encarnacion female 28.0 0 0 230434 13.0000 NaN S 9 NaN Spain
14 3 Khalil, Mr. Betros male NaN 1 0 2660 14.4542 NaN C NaN NaN NaN
15 1 Daniels, Miss. Sarah female 33.0 0 0 113781 151.5500 NaN S 8 NaN NaN
16 3 Ford, Miss. Robina Maggie 'Ruby' female 9.0 2 2 W./C. 6608 34.3750 NaN S NaN NaN Rotherfield, Sussex, England Essex Co, MA
17 3 Thorneycroft, Mrs. Percival (Florence Kate White) female NaN 1 0 376564 16.1000 NaN S 10 NaN NaN
18 3 Lennon, Mr. Denis male NaN 1 0 370371 15.5000 NaN Q NaN NaN NaN
19 3 de Pelsmaeker, Mr. Alfons male 16.0 0 0 345778 9.5000 NaN S NaN NaN NaN
[4]:
y_train[0:20]
[4]:
array([0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0],
      dtype=int64)

For now let’s ignore the Name and Ticket column which should probably be handled as text

[5]:
import pandas as pd
from aikit.transformers import TruncatedSVDWrapper, NumImputer, CountVectorizerWrapper, NumericalEncoder
from aikit.pipeline import GraphPipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
Matplotlib won't work
[6]:
non_text_cols = [c for c in Xtrain.columns if c not in ("ticket","name")] # everything that is not text
non_text_cols
[6]:
['pclass',
 'sex',
 'age',
 'sibsp',
 'parch',
 'fare',
 'cabin',
 'embarked',
 'boat',
 'body',
 'home_dest']
[7]:
gpipeline = GraphPipeline(models = { "enc":NumericalEncoder(),
                                     "imp":NumImputer(),
                                     "forest":RandomForestClassifier(n_estimators=100)
                                   },
                          edges = [("enc","imp","forest")])

gpipeline.fit(Xtrain.loc[:,non_text_cols],y_train)
gpipeline.graphviz
[7]:
_images/notebooks_GraphPipeline_9_0.svg

Let’s do a cross-validation

[8]:
from aikit.cross_validation import cross_validation
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(10, random_state=123, shuffle=True)

cv_result = cross_validation(gpipeline, Xtrain.loc[:,non_text_cols], y_train,cv = cv,
                             scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    3.8s finished
[8]:
test_roc_auc test_accuracy test_neg_log_loss train_roc_auc train_accuracy train_neg_log_loss fit_time score_time n_test_samples fold_nb
0 0.997332 0.990476 -0.050391 0.999830 0.995758 -0.029559 0.200567 0.076803 105 0
1 0.968369 0.961905 -0.723250 0.999986 0.997879 -0.022651 0.192597 0.066856 105 1
2 0.983232 0.942857 -0.154483 0.999816 0.995758 -0.026256 0.200526 0.069852 105 2
3 1.000000 1.000000 -0.035742 0.999707 0.995758 -0.030825 0.210022 0.070714 105 3
4 0.996380 0.961905 -0.088300 0.999802 0.995758 -0.028642 0.199074 0.064868 105 4
5 0.991806 0.952381 -0.125793 0.999797 0.997879 -0.025816 0.193644 0.070361 105 5
6 1.000000 1.000000 -0.040940 0.999703 0.995758 -0.029609 0.215009 0.065831 105 6
7 0.996380 0.980952 -0.088508 0.999842 0.996819 -0.026614 0.184709 0.077791 105 7
8 0.992838 0.971154 -0.107017 0.999793 0.995763 -0.027610 0.187533 0.063796 104 8
9 0.999613 0.980769 -0.072026 0.999764 0.995763 -0.028887 0.199505 0.062847 104 9

This cross-validate the complete Pipeline. The difference with sklearn function is that : * you can score more than one metric at a time * you retrieve train and test score

[9]:
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[9]:
test_roc_auc         0.992595
test_accuracy        0.974240
test_neg_log_loss   -0.148645
dtype: float64

We can do the same but selecting the columns directly in the pipeline :

[10]:
from aikit.transformers import ColumnsSelector
gpipeline2 = GraphPipeline(models = { "sel":ColumnsSelector(columns_to_use=non_text_cols),
                                      "enc":NumericalEncoder(columns_to_use="object"),
                                      "imp":NumImputer(),
                                      "forest":RandomForestClassifier(n_estimators=100, random_state=123)
                                    },
                         edges = [("sel","enc","imp","forest")])

gpipeline2.fit(Xtrain,y_train)
gpipeline2.graphviz
[10]:
_images/notebooks_GraphPipeline_15_0.svg

Remark : ‘columns_to_use=”object”’ tells aikit to encode the columns of type object, it will keep the rest untouched

[11]:
cv_result = cross_validation(gpipeline2,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    4.0s finished
[11]:
test_roc_auc         0.991698
test_accuracy        0.972335
test_neg_log_loss   -0.178280
dtype: float64

Now let’s see what we can do with the columns we excluded. We could craft features from them, but let’s try to use them as text directly.

[12]:
text_cols = ["ticket","name"]
vect = CountVectorizerWrapper(analyzer="word", columns_to_use=text_cols)
vect.fit(Xtrain,y_train)
[12]:
CountVectorizerWrapper(analyzer='word', column_prefix='BAG',
                       columns_to_use=['ticket', 'name'],
                       desired_output_type='SparseArray',
                       drop_unused_columns=True, drop_used_columns=True,
                       max_df=1.0, max_features=None, min_df=1, ngram_range=1,
                       regex_match=False, tfidf=False, vocabulary=None)

Remark : aikit CountVectorizer can direcly work on 2 (or more) columns, no need to use a FeatureUnion or something of the sort

[13]:
features = vect.get_feature_names()
features[0:20] + ["..."] + features[-20:]
[13]:
['ticket__BAG__10482',
 'ticket__BAG__110152',
 'ticket__BAG__110413',
 'ticket__BAG__110465',
 'ticket__BAG__110469',
 'ticket__BAG__110489',
 'ticket__BAG__110564',
 'ticket__BAG__110813',
 'ticket__BAG__111163',
 'ticket__BAG__111240',
 'ticket__BAG__111320',
 'ticket__BAG__111361',
 'ticket__BAG__111369',
 'ticket__BAG__111426',
 'ticket__BAG__111427',
 'ticket__BAG__112050',
 'ticket__BAG__112052',
 'ticket__BAG__112053',
 'ticket__BAG__112058',
 'ticket__BAG__11206',
 '...',
 'name__BAG__woolf',
 'name__BAG__woolner',
 'name__BAG__worth',
 'name__BAG__wright',
 'name__BAG__wyckoff',
 'name__BAG__yarred',
 'name__BAG__yasbeck',
 'name__BAG__ylio',
 'name__BAG__yoto',
 'name__BAG__young',
 'name__BAG__youseff',
 'name__BAG__yousif',
 'name__BAG__youssef',
 'name__BAG__yousseff',
 'name__BAG__yrois',
 'name__BAG__zabour',
 'name__BAG__zakarian',
 'name__BAG__zebley',
 'name__BAG__zenni',
 'name__BAG__zillah']

The encoder directly encodes the 2 features

[14]:
xx_res = vect.transform(Xtrain)
xx_res
[14]:
<1048x2440 sparse matrix of type '<class 'numpy.int32'>'
        with 5414 stored elements in COOrdinate format>

Again let’s create a GraphPipeline to cross-validate

[15]:
gpipeline3 = GraphPipeline(models = {"vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
                                     "logit":LogisticRegression(solver="liblinear", random_state=123)},
                           edges=[("vect","logit")])
gpipeline3.fit(Xtrain,y_train)
gpipeline3.graphviz
[15]:
_images/notebooks_GraphPipeline_25_0.svg
[16]:
cv_result = cross_validation(gpipeline3, Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.9s finished
[16]:
test_roc_auc         0.850918
test_accuracy        0.819679
test_neg_log_loss   -0.451681
dtype: float64

We can also try we “bag of char”

[17]:
gpipeline4 = GraphPipeline(models = {
        "vect": CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
        "logit": LogisticRegression(solver="liblinear", random_state=123) }, edges=[("vect","logit")])
gpipeline4.fit(Xtrain,y_train)
gpipeline4.graphviz
[17]:
_images/notebooks_GraphPipeline_28_0.svg
[18]:
cv_result = cross_validation(gpipeline4,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    5.9s finished
[18]:
test_roc_auc         0.849773
test_accuracy        0.813956
test_neg_log_loss   -0.559254
dtype: float64

Now let’s use all the columns

[19]:
gpipeline5 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),("vect","rf")])
gpipeline5.fit(Xtrain,y_train)
gpipeline5.graphviz
[19]:
_images/notebooks_GraphPipeline_31_0.svg

This model uses both set of columns: * bag of word * and categorical/numerical features

[20]:
cv_result = cross_validation(gpipeline5,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   11.0s finished
[20]:
test_roc_auc         0.992779
test_accuracy        0.968507
test_neg_log_loss   -0.173236
dtype: float64

We can also use both Bag of Char and Bag of Word

[21]:
gpipeline6 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_char":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_word":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),("vect_char","rf"),("vect_word","rf")])
gpipeline6.fit(Xtrain,y_train)
gpipeline6.graphviz
[21]:
_images/notebooks_GraphPipeline_35_0.svg
[22]:
cv_result = cross_validation(gpipeline6,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   13.9s finished
[22]:
test_roc_auc         0.947360
test_accuracy        0.843516
test_neg_log_loss   -0.325666
dtype: float64

Maybe we can try SVD to limit dimension of bag of char/word features

[23]:
gpipeline7 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "svd":TruncatedSVDWrapper(n_components=100, random_state=123),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel", "enc","imp","rf"),("vect_word","svd","rf"),("vect_char","svd","rf")])
gpipeline7.fit(Xtrain,y_train)
gpipeline7.graphviz
[23]:
_images/notebooks_GraphPipeline_38_0.svg
[24]:
cv_result = cross_validation(gpipeline7,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   23.4s finished
[24]:
test_roc_auc         0.992953
test_accuracy        0.972326
test_neg_log_loss   -0.167037
dtype: float64

We can even add ‘SVD’ columns AND bag of word/char columns

[25]:
gpipeline8 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "svd":TruncatedSVDWrapper(n_components=100, random_state=123),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
            edges = [("sel","enc","imp","rf"),("vect_word","svd","rf"),("vect_char","svd","rf"),("vect_word","rf"),("vect_char","rf")])

gpipeline8.graphviz
[25]:
_images/notebooks_GraphPipeline_41_0.svg
[26]:
cv_result = cross_validation(gpipeline8,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   22.1s finished
[26]:
test_roc_auc         0.941329
test_accuracy        0.834011
test_neg_log_loss   -0.334545
dtype: float64

Instead of ‘SVD’ we can add a layer that filter columns…

[27]:
from aikit.transformers import FeaturesSelectorClassifier
[28]:
gpipeline9 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "selector":FeaturesSelectorClassifier(n_components=20),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),("vect_word","selector","rf"),("vect_char","selector","rf")])

gpipeline9.graphviz
[28]:
_images/notebooks_GraphPipeline_45_0.svg
Retrieve feature importance

Let’s use that complicated example to show how to retrieve the feature importance

[29]:
gpipeline9.fit(Xtrain, y_train)

df_imp = pd.Series(gpipeline9.models["rf"].feature_importances_,
                  index = gpipeline9.get_input_features_at_node("rf"))
df_imp.sort_values(ascending=False,inplace=True)
df_imp
[29]:
boat____null__             3.839758e-01
sex__female                3.816301e-02
name__BAG__mr              3.715979e-02
name__BAG__mr.             3.636483e-02
fare                       3.419880e-02
name__BAG__mr.             3.133609e-02
sex__male                  2.962421e-02
name__BAG__r.              2.910019e-02
name__BAG__s.              2.776609e-02
boat__15                   2.672268e-02
age                        2.643157e-02
name__BAG__s.              2.500470e-02
name__BAG__ mr.            2.249752e-02
boat__13                   1.863079e-02
boat____default__          1.711391e-02
pclass                     1.665125e-02
name__BAG__                1.597853e-02
sibsp                      1.524516e-02
home_dest____null__        1.015056e-02
boat__7                    9.817018e-03
home_dest____default__     9.534058e-03
boat__C                    9.453317e-03
cabin____null__            8.265959e-03
cabin____default__         7.290940e-03
parch                      7.138940e-03
embarked__S                6.643220e-03
boat__5                    6.206360e-03
name__BAG__iss.            6.139824e-03
embarked__C                6.040638e-03
boat__3                    5.547742e-03
name__BAG__(               5.352397e-03
name__BAG__mr              5.260205e-03
body_isnull                4.829877e-03
name__BAG__ (              4.360392e-03
boat__16                   4.245866e-03
boat__9                    4.224166e-03
boat__D                    4.194419e-03
name__BAG__ss              4.076246e-03
embarked__Q                4.047912e-03
name__BAG__mrs             3.602001e-03
body                       2.955222e-03
name__BAG__rs              2.899086e-03
name__BAG__rs.             2.869114e-03
age_isnull                 2.859144e-03
boat__14                   2.809765e-03
boat__10                   2.695927e-03
name__BAG__rs.             2.165917e-03
boat__12                   2.103210e-03
name__BAG__mrs.            2.064884e-03
home_dest__New York, NY    1.799501e-03
boat__11                   1.495248e-03
name__BAG__miss            1.054318e-03
name__BAG__mrs             9.616334e-04
boat__4                    9.420111e-04
boat__6                    8.184419e-04
home_dest__London          7.515602e-04
boat__8                    3.679950e-04
fare_isnull                2.438310e-09
dtype: float64
[30]:
cv_result = cross_validation(gpipeline9,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   15.0s finished
[30]:
test_roc_auc         0.994108
test_accuracy        0.973288
test_neg_log_loss   -0.153255
dtype: float64
[31]:
gpipeline10 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "svd":TruncatedSVDWrapper(n_components=10),
    "selector":FeaturesSelectorClassifier(n_components=10, random_state=123),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),
                       ("vect_word","selector","rf"),
                       ("vect_char","selector","rf"),
                       ("vect_word","svd","rf"),
                       ("vect_char","svd","rf")])

gpipeline10.fit(Xtrain,y_train)
gpipeline10.graphviz
[31]:
_images/notebooks_GraphPipeline_49_0.svg

In this model here is what is done : * categorical columns are encoded (‘enc’) * missing values are filled (‘imp’) * bag of word and bag of char are created, for the two text features * an SVD is done on those * a selector is called to select most important bag of word/char features * everything is given to a RandomForest

[32]:
cv_result = cross_validation(gpipeline10,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   16.9s finished
[32]:
test_roc_auc         0.994333
test_accuracy        0.975201
test_neg_log_loss   -0.143788
dtype: float64

As we saw the GraphPipeline allow flexibility in the creation of models and several choices can be easily tested.

Again, it is not the best possible choices for that database, the example are here to illustrate the capabilities.

Better score could be obtained by adjusting hyper-parameters and/or models/transformers and creating some new features.

[ ]:

How to load a model from a json

This notebook shows how to save the definition into a json object and reload it to be trained

[1]:
import warnings
warnings.filterwarnings('ignore') # to remove gensim warning
[2]:
from aikit.model_definition import sklearn_model_from_param
Matplotlib won't work

The idea is to be able to define a model by its name and its parameters. The overral syntax is :

(ModelName , {hyperparameters})

Example : this is a RandomForestClassifier

[3]:
rf_json = ("RandomForestClassifier",{"n_estimators":100})
rf_json
[3]:
('RandomForestClassifier', {'n_estimators': 100})

… which you can create using ‘sklearn_model_from_param’

[4]:
rf = sklearn_model_from_param(rf_json)
rf
[4]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

The idea is simple :

  • sklearn model that klass(**kwargs)
  • corresponds to 2-uple : ‘klass’,kwargs

You can create more complexe model, like GraphPipeline

[5]:
json_enc = ("NumericalEncoder",{"columns_to_use": ["BLOCK" + "NUMBERTOKEN","DATETOKEN","CURRENCYTOKEN"]})
json_vec = ("CountVectorizerWrapper",{"analyzer":"char","ngram_range":(1,4),"columns_to_use":["STRINGLINE"]})
json_rf  = ("RandomForestClassifier",{"n_estimators":500})
json_des = ("GraphPipeline",{"models":{"encoder":json_enc,
                                       "vect":json_vec,
                                       "rf":json_rf},
            "edges":[("encoder","rf"),("vect","rf")]})
json_des
[5]:
('GraphPipeline',
 {'models': {'encoder': ('NumericalEncoder',
    {'columns_to_use': ['BLOCKNUMBERTOKEN', 'DATETOKEN', 'CURRENCYTOKEN']}),
   'vect': ('CountVectorizerWrapper',
    {'analyzer': 'char',
     'ngram_range': (1, 4),
     'columns_to_use': ['STRINGLINE']}),
   'rf': ('RandomForestClassifier', {'n_estimators': 500})},
  'edges': [('encoder', 'rf'), ('vect', 'rf')]})

… and again you can convert it to a real model :

[6]:
gpipe = sklearn_model_from_param(json_des)
gpipe
gpipe.graphviz
[6]:
_images/notebooks_ModelJson_12_0.svg
[7]:
gpipe.models["encoder"]
[7]:
NumericalEncoder(columns_to_use=['BLOCKNUMBERTOKEN', 'DATETOKEN',
                                 'CURRENCYTOKEN'],
                 desired_output_type='DataFrame', drop_unused_columns=False,
                 drop_used_columns=True, encoding_type='dummy',
                 max_cum_proba=0.95, max_modalities_number=100,
                 max_na_percentage=0.05, min_modalities_number=20,
                 min_nb_observations=10, regex_match=False)
[8]:
gpipe.models["rf"]
[8]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
[9]:
gpipe.models["vect"]
[9]:
CountVectorizerWrapper(analyzer='char', column_prefix='BAG',
                       columns_to_use=['STRINGLINE'],
                       desired_output_type='SparseArray',
                       drop_unused_columns=True, drop_used_columns=True,
                       max_df=1.0, max_features=None, min_df=1,
                       ngram_range=(1, 4), regex_match=False, tfidf=False,
                       vocabulary=None)

If a model has another has its parameters, like a ‘BoxCoxTargetTransformer’ or ‘Stacker’… it works the same

[10]:
json_full = "BoxCoxTargetTransformer",{"model":json_des,"ll":0}
json_full
[10]:
('BoxCoxTargetTransformer',
 {'model': ('GraphPipeline',
   {'models': {'encoder': ('NumericalEncoder',
      {'columns_to_use': ['BLOCKNUMBERTOKEN', 'DATETOKEN', 'CURRENCYTOKEN']}),
     'vect': ('CountVectorizerWrapper',
      {'analyzer': 'char',
       'ngram_range': (1, 4),
       'columns_to_use': ['STRINGLINE']}),
     'rf': ('RandomForestClassifier', {'n_estimators': 500})},
    'edges': [('encoder', 'rf'), ('vect', 'rf')]}),
  'll': 0})
[11]:
model = sklearn_model_from_param(json_full)
model
[11]:
BoxCoxTargetTransformer(ll=0,
                        model=GraphPipeline(edges=[('encoder', 'rf'),
                                                   ('vect', 'rf')],
                                            models={'encoder': NumericalEncoder(columns_to_use=['BLOCKNUMBERTOKEN',
                                                                                                'DATETOKEN',
                                                                                                'CURRENCYTOKEN'],
                                                                                desired_output_type='DataFrame',
                                                                                drop_unused_columns=False,
                                                                                drop_used_columns=True,
                                                                                encoding_type='dummy',
                                                                                max_cum_proba=0.95,
                                                                                max_modalities_number=100,
                                                                                max_na_percenta...
                                                                                 random_state=None,
                                                                                 verbose=0,
                                                                                 warm_start=False),
                                                    'vect': CountVectorizerWrapper(analyzer='char',
                                                                                   column_prefix='BAG',
                                                                                   columns_to_use=['STRINGLINE'],
                                                                                   desired_output_type='SparseArray',
                                                                                   drop_unused_columns=True,
                                                                                   drop_used_columns=True,
                                                                                   max_df=1.0,
                                                                                   max_features=None,
                                                                                   min_df=1,
                                                                                   ngram_range=(1,
                                                                                                4),
                                                                                   regex_match=False,
                                                                                   tfidf=False,
                                                                                   vocabulary=None)},
                                            no_concat_nodes=None,
                                            verbose=False))
[12]:
model.model.graphviz
[12]:
_images/notebooks_ModelJson_19_0.svg
[13]:
from aikit.model_definition import DICO_NAME_KLASS

For it to work : the model should be added within DICO_NAME_KLASS

[14]:
DICO_NAME_KLASS
[14]:
registered klasses :
AgglomerativeClusteringWrapper
BoxCoxTargetTransformer
CdfScaler
Char2VecVectorizer
ColumnsSelector
CountVectorizerWrapper
DBSCANWrapper
ExtraTreesClassifier
ExtraTreesRegressor
FeaturesSelectorClassifier
FeaturesSelectorRegressor
GraphPipeline
KMeansTransformer
KMeansWrapper
LGBMClassifier
LGBMRegressor
Lasso
LogisticRegression
NumImputer
NumericalEncoder
OutSamplerTransformer
PCAWrapper
PassThrough
Pipeline
RandomForestClassifier
RandomForestRegressor
Ridge
StackerClassifier
StackerRegressor
TargetEncoderClassifier
TargetEncoderEntropyClassifier
TargetEncoderRegressor
TextDefaultProcessing
TextDigitAnonymizer
TextNltkProcessing
TruncatedSVDWrapper
Word2VecVectorizer

For it to work, each model should be registered

Auto ML

This notebook will explain the auto-ml capabilities of aikit.

It shows the several things involved. If you just want to run it you should use the automl launcher

Let’s start by loading some small data

[1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

from aikit.datasets.datasets import load_dataset,DatasetEnum
dfX, y, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
dfX.head()
[1]:
pclass name sex age sibsp parch ticket fare cabin embarked boat body home_dest
0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S NaN 175.0 Dorchester, MA
1 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB
2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN NaN
3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 39.6875 NaN S NaN NaN NaN
4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S 8 NaN NaN
[2]:
y[0:5]
[2]:
array([0, 0, 1, 0, 1])

Now let’s load what is needed

[4]:
from aikit.ml_machine import AutoMlConfig, JobConfig,  MlJobManager, MlJobRunner, AutoMlResultReader
from aikit.ml_machine import FolderDataPersister, SavingType, AutoMlModelGuider

AutoML configuration object

This object will contain all the relevant information about the problem at hand : * it’s type : REGRESSION or CLASSIFICATION * the information about the column in the data * the steps that are needed in the processing pipeline (see explanation after) * the models that are to be tested * …

By default the model will guess everything but everything can be changed if needed

[5]:
auto_ml_config = AutoMlConfig(dfX = dfX, y = y, name = "titanic")
auto_ml_config.guess_everything()
auto_ml_config
[5]:
<aikit.ml_machine.ml_machine.AutoMlConfig object at 0x10c90fa90>
type of problem : CLASSIFICATION
type of problem
[6]:
auto_ml_config.type_of_problem
[6]:
'CLASSIFICATION'

The config guess that it was a Classification problem

information about columns
[7]:
auto_ml_config.columns_informations
[7]:
OrderedDict([('pclass',
              {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
             ('name',
              {'TypeOfVariable': 'TEXT', 'HasMissing': False, 'ToKeep': True}),
             ('sex',
              {'TypeOfVariable': 'CAT', 'HasMissing': False, 'ToKeep': True}),
             ('age',
              {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
             ('sibsp',
              {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
             ('parch',
              {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
             ('ticket',
              {'TypeOfVariable': 'TEXT', 'HasMissing': False, 'ToKeep': True}),
             ('fare',
              {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
             ('cabin',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
             ('embarked',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
             ('boat',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
             ('body',
              {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
             ('home_dest',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True})])
[8]:
pd.DataFrame(auto_ml_config.columns_informations).T
[8]:
HasMissing ToKeep TypeOfVariable
pclass False True NUM
name False True TEXT
sex False True CAT
age True True NUM
sibsp False True NUM
parch False True NUM
ticket False True TEXT
fare True True NUM
cabin True True CAT
embarked True True CAT
boat True True CAT
body True True NUM
home_dest True True CAT

For each column in the DataFrame, its type were guess among the three possible values : * NUM : for numerical columns * TEXT : for columns that contains text * CAT : for categorical columns

Remarks: * The difference between TEXT and CAT is based on the number of different modalities * Be careful with categorical value that are encoded into integers (algorithm won’t know that it is really a categorical feature)

columns block
[9]:
auto_ml_config.columns_block
[9]:
OrderedDict([('NUM', ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']),
             ('TEXT', ['name', 'ticket']),
             ('CAT', ['sex', 'cabin', 'embarked', 'boat', 'home_dest'])])

the ml machine has the notion of block of columns. For some use-case features naturally falls into blocks. By default the tool will use the type of feature has blocks. But other things can be used.

The ml machine will sometimes try to create a model without a block

needed steps
[10]:
auto_ml_config.needed_steps
[10]:
[{'step': 'TextPreprocessing', 'optional': True},
 {'step': 'TextEncoder', 'optional': False},
 {'step': 'TextDimensionReduction', 'optional': True},
 {'step': 'CategoryEncoder', 'optional': False},
 {'step': 'MissingValueImputer', 'optional': False},
 {'step': 'Scaling', 'optional': True},
 {'step': 'DimensionReduction', 'optional': True},
 {'step': 'FeatureExtraction', 'optional': True},
 {'step': 'FeatureSelection', 'optional': True},
 {'step': 'Model', 'optional': False}]

The ml machine will create processing pipeline by assembling different steps. Here are the steps it will use for that use case :

  • TextPreprocessing
  • TextEncoder : encoding of text into numerical values
  • TextDimensionReduction : specific dimension reduction for text based features
  • CategoryEncoder : encoder of categorical data
  • MissingValueImputer : since there are missing value they need to be filled
  • Scaling : step to re-scale features
  • DimensionReduction : generic dimension reduction
  • FeatureExtraction : create new features
  • FeatureSelction : select feature
  • Model : the final classification/regression model
models to keep
[11]:
auto_ml_config.models_to_keep
[11]:
[('Model', 'LogisticRegression'),
 ('Model', 'RandomForestClassifier'),
 ('Model', 'ExtraTreesClassifier'),
 ('Model', 'LGBMClassifier'),
 ('FeatureSelection', 'FeaturesSelectorClassifier'),
 ('TextEncoder', 'CountVectorizerWrapper'),
 ('TextEncoder', 'Word2VecVectorizer'),
 ('TextEncoder', 'Char2VecVectorizer'),
 ('TextPreprocessing', 'TextNltkProcessing'),
 ('TextPreprocessing', 'TextDefaultProcessing'),
 ('TextPreprocessing', 'TextDigitAnonymizer'),
 ('CategoryEncoder', 'NumericalEncoder'),
 ('CategoryEncoder', 'TargetEncoderClassifier'),
 ('MissingValueImputer', 'NumImputer'),
 ('DimensionReduction', 'TruncatedSVDWrapper'),
 ('DimensionReduction', 'PCAWrapper'),
 ('TextDimensionReduction', 'TruncatedSVDWrapper'),
 ('DimensionReduction', 'KMeansTransformer'),
 ('Scaling', 'CdfScaler')]

This give us the list of models/transformers to test at each steps.

Remarks: * some steps are removed because they have no transformer yet

job configuration

[12]:
job_config = JobConfig()
job_config.guess_cv(auto_ml_config = auto_ml_config, n_splits = 10)
job_config.guess_scoring(auto_ml_config = auto_ml_config)
job_config.score_base_line = None
[13]:
job_config.scoring
[13]:
['accuracy', 'log_loss_patched', 'avg_roc_auc', 'f1_macro']
[14]:
job_config.cv
[14]:
StratifiedKFold(n_splits=10, random_state=123, shuffle=True)
[15]:
job_config.main_scorer
[15]:
'accuracy'
[16]:
job_config.score_base_line

The baseline can be setted if we know what a good performance is. It will be used to specify the threshold bellow which we stop crossvalidation in the first fold

This object has the specific configuration for the job to do : * how to cross validate * what scoring/benchmark to use

Data Persister

To synchronize processes and to save values, we need an object to take of that.

This object is a DataPersister, which save everything on disk (Other persister using database might be created)

[ ]:
base_folder = # INSERT PATH HERE
data_persister = FolderDataPersister(base_folder = base_folder)

controller

[ ]:
result_reader = AutoMlResultReader(data_persister)
auto_ml_guider = AutoMlModelGuider(result_reader = result_reader,
                                       job_config = job_config,
                                       metric_transformation="default",
                                       avg_metric=True
                                       )

job_controller = MlJobManager(auto_ml_config = auto_ml_config,
                                job_config = job_config,
                                auto_ml_guider = auto_ml_guider,
                                data_persister = data_persister)

the search will be driven by a controller process. This process won’t actually train models but it will decide what models should be tried.

Here three object are actually created : * result reader : its job is to read the result of the auto-ml process and aggregate them

  • auto_ml_guider : its job is to help the controller guide the seach (using a bayesian technic)
  • job_controller : the controller

All those objects need the ‘data_persister’ object to write/read data

Now the controller can be started using:

job_controller.run() You need to launch in a subprocess

Worker(s)

The last things needed is to create worker(s) that will do the actual cross validation. Those worker will : * listen to the controller * does the cross validation of the models they are told * save result

[ ]:
job_runner = MlJobRunner(dfX = dfX ,
                       y = y,
                       groups = None,
                       auto_ml_config = auto_ml_config,
                       job_config = job_config,
                       data_persister = data_persister)

as before the controller can be started using : job_runner.run()

You need to launcher that in a Subprocess or a Thread

Result Reader

After a few models were tested you can see the result, for that you need the ‘result_reader’ (which I re-create here for simplicity)

[ ]:
base_folder = # INSERT path here
data_persister = FolderDataPersister(base_folder = base_folder)

result_reader = AutoMlResultReader(data_persister)

[ ]:
df_results = result_reader.load_all_results()
df_params  = result_reader.load_all_params()
df_errors  = result_reader.load_all_errors()
  • df_results : DataFrame with the scoring results
  • df_params : DataFrame with the parameters of the complete processing pipeline
  • df_errors : DataFrame with the errors

All those DataFrames can be joined using the common ‘job_id’ column

[ ]:
df_merged_result = pd.merge( df_params, df_results, how = "inner",on = "job_id")
df_merged_error  = pd.merge( df_params, df_errors , how = "inner",on = "job_id")

And result can be writted in an Excel file (for example)

[ ]:
try:
    df_merged_result.to_excel(base_folder + "/result.xlsx",index=False)
except OSError:
    print("I couldn't save excel file")

try:
    df_merged_error.to_excel(base_folder + "/result_error.xlsx",index=False)
except OSError:
    print("I couldn't save excel file")

Load a given model

[ ]:
from aikit.ml_machine import FolderDataPersister, SavingType
from aikit.model_definition import sklearn_model_from_param

base_folder = # INSERT path here
data_persister = FolderDataPersister(base_folder = base_folder)

[ ]:
job_id    = # INSERT job_id here
job_param = data_persister.read(job_id, path = "job_param", write_type = SavingType.json)
job_param
[ ]:
model = sklearn_model_from_param(job_param["model_json"])
model
[ ]:

Default Pipeline

This notebook shows how you can use aikit to directly get a default pipeline that you can fit on your data

[1]:
from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)

[2]:
from aikit.ml_machine import get_default_pipeline
model = get_default_pipeline(Xtrain, y_train)
model
Matplotlib won't work
C:\HOMEWARE\Anaconda3-Windows-x86_64\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
[2]:
GraphPipeline(edges=[('ColumnsSelector', 'NumImputer'),
                     ('CountVectorizerWrapper', 'NumImputer'),
                     ('NumericalEncoder', 'NumImputer',
                      'RandomForestClassifier')],
              models={'ColumnsSelector': ColumnsSelector(columns_to_drop=None,
                                                         columns_to_use=['pclass',
                                                                         'age',
                                                                         'sibsp',
                                                                         'parch',
                                                                         'fare',
                                                                         'body'],
                                                         raise_if_shape_differs=True,
                                                         regex_match=False),
                      'CountVectorizerWrapper'...
                      'RandomForestClassifier': RandomForestClassifier(bootstrap=True,
                                                                       class_weight=None,
                                                                       criterion='gini',
                                                                       max_depth=None,
                                                                       max_features='auto',
                                                                       max_leaf_nodes=None,
                                                                       min_impurity_decrease=0.0,
                                                                       min_impurity_split=None,
                                                                       min_samples_leaf=1,
                                                                       min_samples_split=2,
                                                                       min_weight_fraction_leaf=0.0,
                                                                       n_estimators=100,
                                                                       n_jobs=None,
                                                                       oob_score=False,
                                                                       random_state=123,
                                                                       verbose=0,
                                                                       warm_start=False)},
              no_concat_nodes=None, verbose=False)
[3]:
model.graphviz
[3]:
_images/notebooks_DefaultPipeline_3_0.svg
[4]:
model.fit(Xtrain, y_train)
[4]:
GraphPipeline(edges=[('ColumnsSelector', 'NumImputer'),
                     ('CountVectorizerWrapper', 'NumImputer'),
                     ('NumericalEncoder', 'NumImputer',
                      'RandomForestClassifier')],
              models={'ColumnsSelector': ColumnsSelector(columns_to_drop=None,
                                                         columns_to_use=['pclass',
                                                                         'age',
                                                                         'sibsp',
                                                                         'parch',
                                                                         'fare',
                                                                         'body'],
                                                         raise_if_shape_differs=True,
                                                         regex_match=False),
                      'CountVectorizerWrapper'...
                      'RandomForestClassifier': RandomForestClassifier(bootstrap=True,
                                                                       class_weight=None,
                                                                       criterion='gini',
                                                                       max_depth=None,
                                                                       max_features='auto',
                                                                       max_leaf_nodes=None,
                                                                       min_impurity_decrease=0.0,
                                                                       min_impurity_split=None,
                                                                       min_samples_leaf=1,
                                                                       min_samples_split=2,
                                                                       min_weight_fraction_leaf=0.0,
                                                                       n_estimators=100,
                                                                       n_jobs=None,
                                                                       oob_score=False,
                                                                       random_state=123,
                                                                       verbose=0,
                                                                       warm_start=False)},
              no_concat_nodes=None, verbose=False)
[ ]:

[1]:
%load_ext autoreload
%autoreload 2
[26]:
import warnings
warnings.filterwarnings('ignore') # to remove gensim warning

Auto clustering

This notebook will explain the auto-clustering capabilities of aikit.

It shows the several things involved. If you just want to run it you should use the automl launcher

Custom random search

[3]:
import pandas as pd
from sklearn.datasets import load_iris
[4]:
iris = load_iris()
[5]:
X = iris.data
y = iris.target
[6]:
X = pd.DataFrame(X, columns=iris.feature_names)
X.head()
[6]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
[7]:
y = pd.DataFrame(y, columns=['label'])
[9]:
from aikit.ml_machine import AutoMlConfig, JobConfig,  MlJobManager, MlJobRunner, AutoMlResultReader
from aikit.ml_machine import FolderDataPersister, SavingType, AutoMlModelGuider

AutoML configuration object

This object will contain all the relevant information about the problem at hand : * it’s type : REGRESSION or CLASSIFICATION * the information about the column in the data * the steps that are needed in the processing pipeline (see explanation after) * the models that are to be tested * …

By default the model will guess everything but everything can be changed if needed

If y is set to None, it will guess that it is a clustering problem.

[10]:
from aikit.ml_machine.ml_machine_registration import MODEL_REGISTER
[11]:
auto_ml_config = AutoMlConfig(dfX=X, y=None, name = "iris")
auto_ml_config.guess_everything()
[11]:
<aikit.ml_machine.ml_machine.AutoMlConfig object at 0x12eb44630>
type of problem : CLUSTERING
[12]:
auto_ml_config.type_of_problem
[12]:
'CLUSTERING'
[13]:
pd.DataFrame(auto_ml_config.columns_informations).T
[13]:
HasMissing ToKeep TypeOfVariable
sepal length (cm) False True NUM
sepal width (cm) False True NUM
petal length (cm) False True NUM
petal width (cm) False True NUM
[14]:
auto_ml_config.needed_steps
[14]:
[{'step': 'Scaling', 'optional': True},
 {'step': 'DimensionReduction', 'optional': True},
 {'step': 'FeatureExtraction', 'optional': True},
 {'step': 'FeatureSelection', 'optional': True},
 {'step': 'Model', 'optional': False}]
[15]:
auto_ml_config.models_to_keep
[15]:
[('TextEncoder', 'CountVectorizerWrapper'),
 ('TextEncoder', 'Word2VecVectorizer'),
 ('TextEncoder', 'Char2VecVectorizer'),
 ('TextPreprocessing', 'TextNltkProcessing'),
 ('TextPreprocessing', 'TextDefaultProcessing'),
 ('TextPreprocessing', 'TextDigitAnonymizer'),
 ('CategoryEncoder', 'NumericalEncoder'),
 ('MissingValueImputer', 'NumImputer'),
 ('DimensionReduction', 'TruncatedSVDWrapper'),
 ('DimensionReduction', 'PCAWrapper'),
 ('TextDimensionReduction', 'TruncatedSVDWrapper'),
 ('DimensionReduction', 'KMeansTransformer'),
 ('Scaling', 'CdfScaler'),
 ('Model', 'KMeansWrapper'),
 ('Model', 'AgglomerativeClusteringWrapper'),
 ('Model', 'DBSCANWrapper')]

Manual clustering pipeline

[16]:
from aikit.pipeline import GraphPipeline
[17]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
[18]:
gpipeline = GraphPipeline(models={"scaler": StandardScaler(),
                                  "kmeans": KMeans(n_clusters=3)},
                         edges=[("scaler", "kmeans")])

gpipeline.fit(X)
[18]:
GraphPipeline(edges=[('scaler', 'kmeans')],
              models={'kmeans': KMeans(algorithm='auto', copy_x=True,
                                       init='k-means++', max_iter=300,
                                       n_clusters=3, n_init=10, n_jobs=None,
                                       precompute_distances='auto',
                                       random_state=None, tol=0.0001,
                                       verbose=0),
                      'scaler': StandardScaler(copy=True, with_mean=True,
                                               with_std=True)},
              no_concat_nodes=None, verbose=False)
[19]:
labels = gpipeline.predict(X)
[20]:
labels
[20]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1,
       2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1,
       1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2], dtype=int32)
[27]:
from aikit.cross_validation import score_from_params_clustering
cv_result = score_from_params_clustering(gpipeline, X,
                             y=None,
                             scoring=["silhouette", 'calinski_harabaz'],
                             verbose=1)
cv_result
[27]:
test_silhouette test_calinski_harabaz fit_time score_time
0 0.506153 505.957631 0.016606 0.002193
[28]:
cv_result = score_from_params_clustering(KMeans(n_clusters=3), X,
                             y=None,
                             scoring=["silhouette", 'calinski_harabaz'],
                             verbose=1)
cv_result
[28]:
test_silhouette test_calinski_harabaz fit_time score_time
0 0.552819 561.627757 0.017693 0.002682

Auto-ML

[ ]:
job_config = JobConfig()
job_config.guess_scoring(auto_ml_config = auto_ml_config)

job_config.score_base_line = None
[ ]:
job_config.scoring
[ ]:
base_folder = # INSERT PATH HERE
data_persister = FolderDataPersister(base_folder = base_folder)
[ ]:
result_reader = AutoMlResultReader(data_persister)
auto_ml_guider = AutoMlModelGuider(result_reader = result_reader,
                                       job_config = job_config,
                                       metric_transformation="default",
                                       avg_metric=True
                                       )

job_controller = MlJobManager(auto_ml_config = auto_ml_config,
                                job_config = job_config,
                                auto_ml_guider = auto_ml_guider,
                                data_persister = data_persister)
[ ]:
job_runner = MlJobRunner(dfX = X ,
                       y = None,
                       groups = None,
                       auto_ml_config = auto_ml_config,
                       job_config = job_config,
                       data_persister = data_persister)
[ ]:
def my_function(u):
    if u==0:
        job_controller.run()
    if u==1:
        job_runner.run()

Carefull : this will stare 2 deamon thread that won’t stop until you stop them

from multiprocessing.dummy import Pool as ThreadPool pool = ThreadPool(2) results = pool.map(my_function, [0,1])

Choice of columns

[1]:
from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)

from aikit.transformers import NumericalEncoder

C:\HOMEWARE\Anaconda3-Windows-x86_64\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
[2]:
Xtrain
[2]:
pclass name sex age sibsp parch ticket fare cabin embarked boat body home_dest
0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S NaN 175.0 Dorchester, MA
1 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB
2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN NaN
3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 39.6875 NaN S NaN NaN NaN
4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S 8 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1043 2 Sobey, Mr. Samuel James Hayden male 25.0 0 0 C.A. 29178 13.0000 NaN S NaN NaN Cornwall / Houghton, MI
1044 1 Ryerson, Master. John Borie male 13.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C 4 NaN Haverford, PA / Cooperstown, NY
1045 2 Lahtinen, Rev. William male 30.0 1 1 250651 26.0000 NaN S NaN NaN Minneapolis, MN
1046 3 Drazenoic, Mr. Jozef male 33.0 0 0 349241 7.8958 NaN C NaN 51.0 Austria Niagara Falls, NY
1047 2 Hosono, Mr. Masabumi male 42.0 0 0 237798 13.0000 NaN S 10 NaN Tokyo, Japan

1048 rows × 13 columns

[3]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"])
Xencoded = encoder.fit_transform(Xtrain)
Xencoded.head()

[3]:
pclass name age sibsp parch ticket fare cabin embarked boat body sex__male sex__female home_dest____null__ home_dest__New York, NY home_dest__London home_dest____default__
0 1 McCarthy, Mr. Timothy J 54.0 0 0 17463 51.8625 E46 S NaN 175.0 1 0 0 0 0 1
1 1 Fortune, Mr. Mark 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN 1 0 0 0 0 1
2 1 Sagesser, Mlle. Emma 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN 0 1 1 0 0 0
3 3 Panula, Master. Urho Abraham 2.0 4 1 3101295 39.6875 NaN S NaN NaN 1 0 1 0 0 0
4 1 Maioni, Miss. Roberta 16.0 0 0 110152 86.5000 B79 S 8 NaN 0 1 1 0 0 0

Called like this the transformer encodes “sex” and “home_dest” and keeps th other columns untouched

[4]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"], drop_unused_columns=True)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()

(1048, 6)
[4]:
sex__male sex__female home_dest____null__ home_dest__New York, NY home_dest__London home_dest____default__
0 1 0 0 0 0 1
1 1 0 0 0 0 1
2 0 1 1 0 0 0
3 1 0 1 0 0 0
4 0 1 1 0 0 0
Called like this the transformer encodes “sex” and “home_dest” and drop the other columns
[5]:
Xtrain["home_dest"].value_counts()
[5]:
New York, NY                                47
London                                      11
Cornwall / Akron, OH                         9
Winnipeg, MB                                 7
Montreal, PQ                                 7
                                            ..
London / Birmingham                          1
Folkstone, Kent / New York, NY               1
Treherbert, Cardiff, Wales                   1
Devonport, England                           1
Buenos Aires, Argentina / New Jersey, NJ     1
Name: home_dest, Length: 333, dtype: int64
Only the most frequent modalities are kept (this can be changed)
[6]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"],
                           drop_unused_columns=True,
                           min_modalities_number=400)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()

(1048, 336)
[6]:
sex__male sex__female home_dest____null__ home_dest__New York, NY home_dest__London home_dest__Cornwall / Akron, OH home_dest__Winnipeg, MB home_dest__Montreal, PQ home_dest__Philadelphia, PA home_dest__Paris, France ... home_dest__Deer Lodge, MT home_dest__Bristol, England / New Britain, CT home_dest__Holley, NY home_dest__Bryn Mawr, PA, USA home_dest__Tokyo, Japan home_dest__Oslo, Norway Cameron, WI home_dest__Cambridge, MA home_dest__Ireland Brooklyn, NY home_dest__England home_dest__Aughnacliff, Co Longford, Ireland New York, NY
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 336 columns

If I specify ‘min_modalities_number’: 400, all the modalities are kept.

I’ll start filtering the modalities only if I have more than 400)

[7]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"], drop_used_columns=False)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()

(1048, 19)
[7]:
pclass name sex age sibsp parch ticket fare cabin embarked boat body home_dest sex__male sex__female home_dest____null__ home_dest__New York, NY home_dest__London home_dest____default__
0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S NaN 175.0 Dorchester, MA 1 0 0 0 0 1
1 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB 1 0 0 0 0 1
2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN NaN 0 1 1 0 0 0
3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 39.6875 NaN S NaN NaN NaN 1 0 1 0 0 0
4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S 8 NaN NaN 0 1 1 0 0 0
Called like this the transformer encodes “sex” and “home_dest” but also keep them in the final result
[8]:
from aikit.transformers import TruncatedSVDWrapper

X = pd.DataFrame(np.random.randn(100,20), columns=[f"COL_{j}" for j in range(20)])

svd = TruncatedSVDWrapper(n_components=2, drop_used_columns=True)
Xencoded = svd.fit_transform(X)
Xencoded.head()
[8]:
SVD__0 SVD__1
0 2.075079 -0.858158
1 0.332307 -0.970121
2 2.279417 1.340435
3 -0.563442 0.551599
4 -1.640313 -1.569441
[9]:
svd = TruncatedSVDWrapper(n_components=2, drop_used_columns=False)
Xencoded = svd.fit_transform(X)
Xencoded.head()
[9]:
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 COL_8 COL_9 ... COL_12 COL_13 COL_14 COL_15 COL_16 COL_17 COL_18 COL_19 SVD__0 SVD__1
0 0.858982 -0.655989 -0.028417 -0.357398 0.569531 0.145816 0.552368 1.983438 1.092890 -0.453562 ... 0.285189 -0.604234 -1.053623 -0.291745 -1.646335 -0.215531 0.008500 1.100297 2.076530 -0.854836
1 -0.187936 0.041684 0.941944 1.898925 0.179125 0.636418 2.050173 0.229349 -1.910368 0.702720 ... -0.533445 -0.371779 -0.401205 0.231492 -1.043176 1.842388 0.329271 0.882017 0.346758 -0.951460
2 1.097298 -0.136058 -0.323606 -1.096158 -0.009371 -0.945267 1.455854 -0.108160 1.141867 -1.407562 ... 2.310153 2.414735 -0.184708 -1.486121 -0.676003 -0.686621 -0.836830 0.972978 2.330389 1.407472
3 0.928934 0.269935 -1.274605 -0.287077 0.279328 -0.320871 0.802277 -0.713909 -1.039250 1.227245 ... 0.020298 0.259960 -0.885320 0.014820 0.268819 -0.432435 1.254164 0.031453 -0.572056 0.539412
4 -0.714467 1.637883 -0.451313 0.409956 0.565926 0.448906 -0.128214 -0.845320 0.433473 -0.416148 ... 0.758863 -1.702709 -0.000005 -0.293631 -0.859405 -0.167067 0.400996 -1.095900 -1.603850 -1.532443

5 rows × 22 columns

Another example of the usage of ‘drop_used_columns’ and ‘drop_unused_columns’: * in the first case (drop_used_columns = True) : only the SVD columns are retrieved * in the second case (drop_used_columns = False) : I retrive the original columns AND the svd columns

[ ]:

Transformers

[1]:
import warnings
warnings.filterwarnings('ignore')
[2]:
from aikit.datasets.datasets import load_dataset,DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
[3]:
Xtrain.head(20)
[3]:
pclass name sex age sibsp parch ticket fare cabin embarked boat body home_dest
0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S NaN 175.0 Dorchester, MA
1 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB
2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN NaN
3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 39.6875 NaN S NaN NaN NaN
4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S 8 NaN NaN
5 3 Waelens, Mr. Achille male 22.0 0 0 345767 9.0000 NaN S NaN NaN Antwerp, Belgium / Stanton, OH
6 3 Reed, Mr. James George male NaN 0 0 362316 7.2500 NaN S NaN NaN NaN
7 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S 8 NaN Brooklyn, NY
8 1 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) female 18.0 1 0 13695 60.0000 C31 S 6 NaN Huntington, WV
9 1 Rowe, Mr. Alfred G male 33.0 0 0 113790 26.5500 NaN S NaN 109.0 London
10 3 Meo, Mr. Alfonzo male 55.5 0 0 A.5. 11206 8.0500 NaN S NaN 201.0 NaN
11 3 Abbott, Mr. Rossmore Edward male 16.0 1 1 C.A. 2673 20.2500 NaN S NaN 190.0 East Providence, RI
12 3 Elias, Mr. Dibo male NaN 0 0 2674 7.2250 NaN C NaN NaN NaN
13 2 Reynaldo, Ms. Encarnacion female 28.0 0 0 230434 13.0000 NaN S 9 NaN Spain
14 3 Khalil, Mr. Betros male NaN 1 0 2660 14.4542 NaN C NaN NaN NaN
15 1 Daniels, Miss. Sarah female 33.0 0 0 113781 151.5500 NaN S 8 NaN NaN
16 3 Ford, Miss. Robina Maggie 'Ruby' female 9.0 2 2 W./C. 6608 34.3750 NaN S NaN NaN Rotherfield, Sussex, England Essex Co, MA
17 3 Thorneycroft, Mrs. Percival (Florence Kate White) female NaN 1 0 376564 16.1000 NaN S 10 NaN NaN
18 3 Lennon, Mr. Denis male NaN 1 0 370371 15.5000 NaN Q NaN NaN NaN
19 3 de Pelsmaeker, Mr. Alfons male 16.0 0 0 345778 9.5000 NaN S NaN NaN NaN
[4]:
columns_block = {"CAT":["sex","embarked","cabin"],
                 "NUM":["pclass","age","sibsp","parch","fare"],
                 "TEXT":["name","ticket"]}
columns_block
[4]:
{'CAT': ['sex', 'embarked', 'cabin'],
 'NUM': ['pclass', 'age', 'sibsp', 'parch', 'fare'],
 'TEXT': ['name', 'ticket']}

Numerical Encoder

[5]:
from aikit.transformers import NumericalEncoder

encoder = NumericalEncoder(columns_to_use=columns_block["CAT"] + columns_block["NUM"])
Xtrain_cat = encoder.fit_transform(Xtrain)
[6]:
Xtrain_cat.head()
[6]:
name ticket boat body home_dest sex__male sex__female embarked__S embarked__C embarked__Q ... fare__7.925 fare__7.225 fare__7.25 fare__8.6625 fare__0.0 fare__69.55 fare__15.5 fare__7.8542 fare__21.0 fare____default__
0 McCarthy, Mr. Timothy J 17463 NaN 175.0 Dorchester, MA 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 Fortune, Mr. Mark 19950 NaN NaN Winnipeg, MB 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 1
2 Sagesser, Mlle. Emma PC 17477 9 NaN NaN 0 1 0 1 0 ... 0 0 0 0 0 0 0 0 0 1
3 Panula, Master. Urho Abraham 3101295 NaN NaN NaN 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 Maioni, Miss. Roberta 110152 8 NaN NaN 0 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 80 columns

[7]:
list(Xtrain_cat.columns)
[7]:
['name',
 'ticket',
 'boat',
 'body',
 'home_dest',
 'sex__male',
 'sex__female',
 'embarked__S',
 'embarked__C',
 'embarked__Q',
 'cabin____null__',
 'cabin____default__',
 'pclass__3',
 'pclass__1',
 'pclass__2',
 'age____null__',
 'age__24.0',
 'age__18.0',
 'age__22.0',
 'age__21.0',
 'age__30.0',
 'age__36.0',
 'age__28.0',
 'age__19.0',
 'age__27.0',
 'age__25.0',
 'age__29.0',
 'age__23.0',
 'age__31.0',
 'age__26.0',
 'age__35.0',
 'age__32.0',
 'age__33.0',
 'age__39.0',
 'age__17.0',
 'age__42.0',
 'age__45.0',
 'age__16.0',
 'age__20.0',
 'age__50.0',
 'age__40.0',
 'age__34.0',
 'age__38.0',
 'age__1.0',
 'age__47.0',
 'age____default__',
 'sibsp__0',
 'sibsp__1',
 'sibsp__2',
 'sibsp__4',
 'sibsp__3',
 'sibsp__8',
 'sibsp__5',
 'parch__0',
 'parch__1',
 'parch__2',
 'parch__5',
 'parch__4',
 'parch__3',
 'parch__9',
 'parch__6',
 'fare__13.0',
 'fare__7.75',
 'fare__8.05',
 'fare__7.8958',
 'fare__26.0',
 'fare__10.5',
 'fare__7.775',
 'fare__7.2292',
 'fare__26.55',
 'fare__7.925',
 'fare__7.225',
 'fare__7.25',
 'fare__8.6625',
 'fare__0.0',
 'fare__69.55',
 'fare__15.5',
 'fare__7.8542',
 'fare__21.0',
 'fare____default__']

Target Encoder

[8]:
from aikit.transformers import TargetEncoderClassifier

encoder = TargetEncoderClassifier(columns_to_use=columns_block["CAT"] + columns_block["NUM"])
Xtrain_cat = encoder.fit_transform(Xtrain, y_train)

Xtrain_cat.head()
[8]:
name ticket boat body home_dest sex__target_1 embarked__target_1 cabin__target_1 pclass__target_1 age__target_1 sibsp__target_1 parch__target_1 fare__target_1
0 McCarthy, Mr. Timothy J 17463 NaN 175.0 Dorchester, MA 0.184564 0.341945 0.195652 0.612766 0.547457 0.353941 0.328632 0.185878
1 Fortune, Mr. Mark 19950 NaN NaN Winnipeg, MB 0.184564 0.341945 0.597354 0.612766 0.453744 0.524017 0.276773 0.597354
2 Sagesser, Mlle. Emma PC 17477 9 NaN NaN 0.746398 0.576720 0.695652 0.612766 0.481185 0.353941 0.328632 0.695652
3 Panula, Master. Urho Abraham 3101295 NaN NaN NaN 0.184564 0.341945 0.314483 0.264822 0.463933 0.267929 0.653542 0.166522
4 Maioni, Miss. Roberta 110152 8 NaN NaN 0.746398 0.341945 0.391304 0.612766 0.427970 0.353941 0.328632 0.710857

CountVectorizer

[9]:
from aikit.transformers import CountVectorizerWrapper

encoder = CountVectorizerWrapper(analyzer="char",columns_to_use=columns_block["TEXT"])

Xtrain_enc = encoder.fit_transform(Xtrain)
Xtrain_enc
[9]:
<1048x62 sparse matrix of type '<class 'numpy.int32'>'
        with 21936 stored elements in COOrdinate format>
[10]:
encoder.get_feature_names()
[10]:
['name__BAG__ ',
 "name__BAG__'",
 'name__BAG__(',
 'name__BAG__)',
 'name__BAG__,',
 'name__BAG__-',
 'name__BAG__.',
 'name__BAG__/',
 'name__BAG__a',
 'name__BAG__b',
 'name__BAG__c',
 'name__BAG__d',
 'name__BAG__e',
 'name__BAG__f',
 'name__BAG__g',
 'name__BAG__h',
 'name__BAG__i',
 'name__BAG__j',
 'name__BAG__k',
 'name__BAG__l',
 'name__BAG__m',
 'name__BAG__n',
 'name__BAG__o',
 'name__BAG__p',
 'name__BAG__q',
 'name__BAG__r',
 'name__BAG__s',
 'name__BAG__t',
 'name__BAG__u',
 'name__BAG__v',
 'name__BAG__w',
 'name__BAG__x',
 'name__BAG__y',
 'name__BAG__z',
 'ticket__BAG__ ',
 'ticket__BAG__.',
 'ticket__BAG__/',
 'ticket__BAG__0',
 'ticket__BAG__1',
 'ticket__BAG__2',
 'ticket__BAG__3',
 'ticket__BAG__4',
 'ticket__BAG__5',
 'ticket__BAG__6',
 'ticket__BAG__7',
 'ticket__BAG__8',
 'ticket__BAG__9',
 'ticket__BAG__a',
 'ticket__BAG__c',
 'ticket__BAG__e',
 'ticket__BAG__f',
 'ticket__BAG__h',
 'ticket__BAG__i',
 'ticket__BAG__l',
 'ticket__BAG__n',
 'ticket__BAG__o',
 'ticket__BAG__p',
 'ticket__BAG__q',
 'ticket__BAG__r',
 'ticket__BAG__s',
 'ticket__BAG__t',
 'ticket__BAG__w']

Truncated SVD

[11]:
from aikit.transformers import TruncatedSVDWrapper
svd = TruncatedSVDWrapper(n_components=0.1)

xx_train_small_svd = svd.fit_transform(Xtrain_enc)
xx_train_small_svd
[11]:
SVD__0 SVD__1 SVD__2 SVD__3 SVD__4 SVD__5
0 5.087193 -0.825855 -1.777827 1.079631 1.131004 -1.943098
1 4.329638 -1.584227 -1.592165 0.356413 0.598414 -0.024379
2 5.899164 -0.197573 2.482450 -0.058293 -1.766613 -0.709567
3 7.419224 0.221492 -2.534401 2.470868 -1.714951 0.337166
4 5.771280 1.272598 -0.836011 0.126549 1.036792 -0.689974
... ... ... ... ... ... ...
1043 7.751914 -0.199622 0.934052 -0.541200 -1.632366 -0.380562
1044 7.063821 -1.730135 -1.130117 -2.086049 0.066611 -0.866250
1045 5.595593 1.029825 1.534740 1.149602 1.487406 1.612092
1046 4.625180 -1.261425 -0.873728 -0.566604 1.032036 0.202452
1047 5.020200 1.633941 -2.131337 -0.567731 0.226184 -1.410670

1048 rows × 6 columns

Column selector

[12]:
from aikit.transformers import ColumnsSelector
selector = ColumnsSelector(columns_to_use=columns_block["TEXT"])
Xtrain_subset = selector.fit_transform(Xtrain)
Xtrain_subset.head(10)
[12]:
name ticket
0 McCarthy, Mr. Timothy J 17463
1 Fortune, Mr. Mark 19950
2 Sagesser, Mlle. Emma PC 17477
3 Panula, Master. Urho Abraham 3101295
4 Maioni, Miss. Roberta 110152
5 Waelens, Mr. Achille 345767
6 Reed, Mr. James George 362316
7 Swift, Mrs. Frederick Joel (Margaret Welles Ba... 17466
8 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) 13695
9 Rowe, Mr. Alfred G 113790
[ ]:

Stacking

This notebook will show you how to stacks model using aikit

Regression with OutSamplerTransformer

Let’s start by creating a simple Regression dataset

[1]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=1000, n_features=30, n_informative=10, n_targets=1, random_state=123)

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.1, shuffle=True, random_state=123)
[2]:
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import Ridge

from aikit.pipeline import GraphPipeline
from aikit.models import OutSamplerTransformer

cv = 10

stacking_model = GraphPipeline(models = {
    "rf"   : OutSamplerTransformer(RandomForestRegressor(random_state=123, n_estimators=10) , cv = cv),
    "lgbm" : OutSamplerTransformer(LGBMRegressor(random_state=123, n_estimators=10)         ,  cv = cv),
    "ridge": OutSamplerTransformer(Ridge(random_state=123)     , cv = cv),
    "blender":Ridge()
    }, edges = [("rf","blender"),("lgbm","blender"),("ridge","blender")])


stacking_model.graphviz
Matplotlib won't work
C:\HOMEWARE\Anaconda3-Windows-x86_64\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
[2]:
_images/notebooks_Stacking_3_2.svg

This model behaves like a regular sklearn regressor.

It can be fitted :

[3]:
stacking_model.fit(Xtrain, ytrain)
[3]:
GraphPipeline(edges=[('rf', 'blender'), ('lgbm', 'blender'),
                     ('ridge', 'blender')],
              models={'blender': Ridge(alpha=1.0, copy_X=True,
                                       fit_intercept=True, max_iter=None,
                                       normalize=False, random_state=None,
                                       solver='auto', tol=0.001),
                      'lgbm': OutSamplerTransformer(columns_prefix=None, cv=10,
                                                    desired_output_type=None,
                                                    model=LGBMRegressor(boosting_type='gbdt',
                                                                        class_weight=...
                                                                              n_jobs=None,
                                                                              oob_score=False,
                                                                              random_state=123,
                                                                              verbose=0,
                                                                              warm_start=False),
                                                  random_state=123),
                      'ridge': OutSamplerTransformer(columns_prefix=None, cv=10,
                                                     desired_output_type=None,
                                                     model=Ridge(alpha=1.0,
                                                                 copy_X=True,
                                                                 fit_intercept=True,
                                                                 max_iter=None,
                                                                 normalize=False,
                                                                 random_state=123,
                                                                 solver='auto',
                                                                 tol=0.001),
                                                     random_state=123)},
              no_concat_nodes=None, verbose=False)

You can predict

[4]:
yhat_test = stacking_model.predict(Xtest)
[5]:
from sklearn.metrics import mean_squared_error
10_000 * mean_squared_error(ytest, yhat_test)
[5]:
9.978737586369565

Let’s describe what goes on during the fit:

  1. cross_val_predict is called on each of the model => This create out-of-sample predictions for each observation
  2. the each model is re-fitted on the full data => To be ready when called for a new prediction
  3. The blender is given the out-of-sample prediction of the 3 models to fit a final model

The ‘OutSamplerTransformer’ object implements the logic to create out-of-sample prediction whereas GraphPipeline act to pass transformation from one node to the next(s).

With that logic you can do more complexe things

Let’s say you have missing value to fill before feeding the data to the models (Remark : this is not the case here).

You can just add a node at the top of the pipeline:

[6]:
from aikit.transformers import NumImputer

stacking_model = GraphPipeline(models = {
    "imp"  : NumImputer(),
    "rf"   : OutSamplerTransformer(RandomForestRegressor(random_state=123, n_estimators=10) , cv = cv),
    "lgbm" : OutSamplerTransformer(LGBMRegressor(random_state=123, n_estimators=10)         ,  cv = cv),
    "ridge": OutSamplerTransformer(Ridge(random_state=123)     , cv = cv),
    "blender":Ridge()
    }, edges = [("imp", "rf","blender"),("imp", "lgbm","blender"),("imp", "ridge","blender")])

stacking_model.graphviz

[6]:
_images/notebooks_Stacking_10_0.svg

Let’s say you want to pass to the blender the predictions of the model along with the original features.

You can just add another edge :

[7]:
from aikit.transformers import NumImputer

stacking_model = GraphPipeline(models = {
    "imp"  : NumImputer(),
    "rf"   : OutSamplerTransformer(RandomForestRegressor(random_state=123, n_estimators=10) , cv = cv),
    "lgbm" : OutSamplerTransformer(LGBMRegressor(random_state=123, n_estimators=10)         ,  cv = cv),
    "ridge": OutSamplerTransformer(Ridge(random_state=123)     , cv = cv),
    "blender":Ridge()
    }, edges = [("imp", "rf","blender"),("imp", "lgbm","blender"),("imp", "ridge","blender"), ("imp", "blender")])

stacking_model.graphviz

[7]:
_images/notebooks_Stacking_12_0.svg

Example on a Classification task

[8]:
from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
Xtrain.head(10)
[8]:
pclass name sex age sibsp parch ticket fare cabin embarked boat body home_dest
0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S NaN 175.0 Dorchester, MA
1 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB
2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN NaN
3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 39.6875 NaN S NaN NaN NaN
4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S 8 NaN NaN
5 3 Waelens, Mr. Achille male 22.0 0 0 345767 9.0000 NaN S NaN NaN Antwerp, Belgium / Stanton, OH
6 3 Reed, Mr. James George male NaN 0 0 362316 7.2500 NaN S NaN NaN NaN
7 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S 8 NaN Brooklyn, NY
8 1 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) female 18.0 1 0 13695 60.0000 C31 S 6 NaN Huntington, WV
9 1 Rowe, Mr. Alfred G male 33.0 0 0 113790 26.5500 NaN S NaN 109.0 London
You can also stack models that works on different part of the data, for example :
  • a CountVectorizer + Logit model that works on “text-like” columns
  • along with a NumericalEncoder + Random Forest RandomForestClassifier for the other columns
[9]:
from aikit.transformers import CountVectorizerWrapper, NumericalEncoder, NumImputer, ColumnsSelector
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from aikit.models import OutSamplerTransformer
from aikit.pipeline import GraphPipeline

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(10, random_state=123, shuffle=True)


text_cols     = ["name","ticket"]
non_text_cols = [c for c in Xtrain.columns if c not in text_cols]

gpipeline = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "rf":OutSamplerTransformer(RandomForestClassifier(n_estimators=100, random_state=123),cv=cv),
    "logit":OutSamplerTransformer(LogisticRegression(random_state=123),cv=cv),
    "blender":LogisticRegression(random_state=123)
},
              edges = [("sel","enc","imp","rf", "blender"),("vect","logit","blender")])

gpipeline.fit(Xtrain,y_train)
gpipeline.graphviz

C:\HOMEWARE\Anaconda3-Windows-x86_64\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
[9]:
_images/notebooks_Stacking_16_1.svg

Probabilities calibration (Platt’s scaling)

Using that object you can also re-calibrate your probabilities. This can be done by using a method called ‘Platt’s scaling’ https://en.wikipedia.org/wiki/Platt_scaling/)

Which consist in feeding the probability of one model to a Logistic Regression which will re-scale them.

[10]:
rf_rescaled = GraphPipeline(models = {
    "sel"  : ColumnsSelector(columns_to_use=non_text_cols),
    "enc"  : NumericalEncoder(),
    "imp"  : NumImputer(),
    "rf"   : OutSamplerTransformer( RandomForestClassifier(class_weight = "auto"), cv = cv),
    "scaling":LogisticRegression()
    }, edges = [('sel','enc','imp','rf','scaling')]
)
rf_rescaled.graphviz
[10]:
_images/notebooks_Stacking_18_0.svg

GraphPipeline

The GraphPipeline object is an extension of sklearn.pipeline.Pipeline but the transformers/models can be chained with any directed graph.

The objects takes as input two arguments:
  • models : dictionary of models (each key is the name of a given node, and each corresponding value is the transformer corresponding to that node)
  • edges : list of tuples that link the nodes to each other
The created object will behave like a regular sklearn model :
  • it can be cloned and thus cross-validated
  • it can pickled
  • set and get params works
  • it can obviously be fitted and returns predictions

See full example here :

Example:

gpipeline = GraphPipeline(models = {"vect" : CountVectorizerWrapper(analyzer="char",ngram_range=(1,4)),
                                        "svd"  : TruncatedSVDWrapper(n_components=400) ,
                                        "logit" : LogisticRegression(class_weight="balanced")},
                               edges = [("vect","svd","logit")]
                               )
This is a model on which we do the following steps:
  1. do a Bag of Char CountVectorizer
  2. apply an SVD on the result
  3. fit a Logistic Regression on the result

The edges parameters just says that there are edges from “vect” to “svd” to “logit”. By convention if a tuple is of length more than 2 it is assumed that all consecutive elements forms an edge (same convention as in DOT graph language)

This model could have be built using sklearn Pipeline. If graphviz is installed you can get a nice vizualization of the graph using:

gpipeline.graphviz
graphviz vizualization of a simple pipeline

Now the aims of the GraphPipeline object is to be able to handle more complexe chaining of transformers. Let’s assume that you have 2 texts columns in your dataset “Text1”,”Text2” and 2 categorical columns “Cat1”,”Cat2” and that you want to do the following:

  • CountVectorizer on your text
  • Categorical Encoder on your categories
  • merging the 2 encoded variables and give the result to a RandomForestClassifier

You would need to create a GraphPipeline like that:

gpipeline = GraphPipeline(models = {"vect" : CountVectorizerWrapper(analyzer="char",ngram_range=(1,4), columns_to_use=["text1","text2"]),
                                    "cat"  : NumericalEncoder(columns_to_use=["cat1","cat2"]) ,
                                    "rf"   : RandomForestClassifier(n_estimators=100)}  ,
                               edges = [("vect","rf"),("cat","rf")]
                               )

Note that this time, the edges parameters tells that there are 2 edges : “vect” to “rf” and “cat” to “rf”

graphviz vizualization of a merging pipeline

This particular graph could have been built using a combination of sklearn.pipeline.Pipeline and sklearn.pipeline.FeatureUnion, however the syntax is easier here.

Let’s take a slightly more complexe exemple:

gpipeline = GraphPipeline(models = {"encoder" : NumericalEncoder(columns_to_use = ["cat1","cat2"]),
                                    "imputer" : NumImputer(),
                                    "vect"    : CountVectorizerWrapper(analyzer="word",columns_to_use=["cat1","cat2"]),
                                    "svd"     : TruncatedSVDWrapper(n_components=50),
                                    "rf"      : RandomForestClassifier(n_estimators=100)
                                        },
                        edges = [("encoder","imputer","rf"),("vect","svd","rf")] )
graphviz vizualization of a more complexe pipeline
In that example we have 2 chains of transformers :
  • one for the text on which we apply a CountVectorizer and then do an SVD
  • one for the categorical variable that we encode and then input the missing values

Now let’s take one last example of something that couldn’t be done using FeatureUnion and Pipeline without computing some transformations twice.

Imagine that you want the RandomForestClassifier to have both the CountVectorizer features and the SVD features, using the Graphpipeline you can do the following model:: You can do that by just adding another edge between “vect” and “rf”:

gpipeline = GraphPipeline(models = {"encoder" : NumericalEncoder(columns_to_use = ["cat1","cat2"]),
                                    "imputer" : NumImputer(),
                                    "vect"    : CountVectorizerWrapper(analyzer="word",columns_to_use=["cat1","cat2"]),
                                    "svd"     : TruncatedSVDWrapper(n_components=50),
                                    "rf"      : RandomForestClassifier(n_estimators=100)
                                        },
                        edges = [("encoder","imputer","rf"),("vect","rf"),("vect","svd","rf")] )
graphviz vizualization of a more complexe pipeline
A variation of that would be to include a node to select feature after the CountVectorizer, that way you could :
  • keep the most important text features as is
  • retain some of the other information but reduce the dimension via an SVD
  • keep the other features
  • feed everything to a given model

So code would be:

gpipeline = GraphPipeline(models = {"encoder" : NumericalEncoder(columns_to_use = ["cat1","cat2"]),
                                    "imputer" : NumImputer(),
                                    "vect"    : CountVectorizerWrapper(analyzer="word",columns_to_use=["cat1","cat2"]),
                                    "sel"     : FeaturesSelectorClassifier(n_components = 25),
                                    "svd"     : TruncatedSVDWrapper(n_components=50),
                                    "rf"      : RandomForestClassifier(n_estimators=100)
                                        },
                        edges = [("encoder","imputer","rf"),("vect","sel","rf"),("vect","svd","rf")] )
graphviz vizualization of a more complexe pipeline

Merging nodes

The GraphPipeline automatically knows how to merge data of different types. For example the output of a CountVectorizer is most of time a sparse matrix whereas the output of an encoder usually is a DataFrame or a numpy array. This is the case for the pipeline

graphviz vizualization of a merging pipeline

before applying the RandomForest a concatenation should be made. The GraphPipeline uses the generic concatenation functions aikit.data_structure_helper.generic_hstack() to handle that.

Remark : in some cases you don’t want the model to handle concatenation, for example because you want to treat the two (or more) inputs separately yourself in the following transformer. In that case you can add specify the argument no_concat_nodes to specify the nodes at which you don’t want the GraphPipeline to concatenate. In that case the following node will receive a dictionnary of transformed data the keys are the name of the preceding nodes and the values the corresponding transformations.

Remark : the order on which the node appear in the graphviz representation is not necessary the order of the concatenation of the features made within the GraphPipeline. The order of concatenation is actually fixed by the order of the edges : the first node that appear within the edges will be on the left of the concatenated data.

Exposed methods

The GraphPipeline exposed the methods that are available in the final node of the estimator (Pipeline is doing exactly the same). For example
  • if the final estimator has a decision_functions method, the GraphPipeline will be able to call it (applying transformations on each node and then decision_function on the final node)
  • if that is not the case an AttributeError will be raised

Features Names

The GraphPipeline also makes it easier to handle features names. Indeed when using it the names of the features are passed down the pipeline using the “get_feature_names” of each node and whenever possible using the “input_features” argument to tell each node the name of the features in input. That way each node knows the name of the feature it has as input and can give the names of its features. Those name are used are columns names whenever a DataFrame is retrieved.

You can also retrieve the features at each node if needed (for Example to look at features importances on the last node). The two methods to do that are:

  • get_feature_names_at_node : retrieve the name of the features at the exit of a given node
  • get_input_features_at_node : retrive the name of the features at the entry of a given node (this can be called at the last node to know to which features corresponds which feature importance for example)

Complete Documentation

Auto ML Overview

Auto ML Launcher

To help to launch the ml machine jobs you can use the MlMachineLauncher object.

it :

  • contains the configurations of the auto-ml
  • has methods to launch a controller, a worker, … (See after)
  • has a method to process command line arguments to quickly create a script that can be used to drive the ml-process (See after)

The easiest way is to create a script like the following one.

Example:

from aikit.datasets import load_dataset, DatasetEnum
from aikit.ml_machine import MlMachineLauncher

def loader():
    """ modify this function to load the data

    Returns
    -------
    dfX, y

    Or
    dfX, y, groups

    """
    dfX, y, *_ = load_dataset(DatasetEnum.titanic)
    return dfX, y

def set_configs(launcher):
    """ modify that function to change launcher configuration """

    launcher.job_config.score_base_line = 0.75
    launcher.job_config.allow_approx_cv = True

    return launcher

if __name__ == "__main__":
    launcher = MlMachineLauncher(base_folder = "C:/automl/titanic",
                                 name = "titanic",
                                 loader = loader,
                                 set_configs = set_configs)

    launcher.execute_processed_command_argument()

(in what follows we will assume that this is the content of the “automl_launcher.py” file)

Here is what is going on:

  1. first import the launcher
  2. define a loader function : it is better to define a loading function that can be called if needed instead of just loading the data (because you don’t always need the data)
  3. create a launcher, with a base folder and the loading function

4. (Optional) : you can change a few things in the configurations. Here we set the base line to 75% and tell the auto-ml that it can do approximate cross-validation. (See the ‘advanced functionnalities’ section) To do that pass a function that change the configuration. 5. Process the command argument to actually start a command

Remarks:

  • if no change in the default configurations are needed you can use set_configs = None
  • the loading function can also return 3 things : dfX, y and groups (if a group cross-validation is needed)
  • don’t forget the ‘if __name__ == “__main__”’ part, since the code uses subprocess it is really needed
Having created a script like that you can now use the script to drive the auto-ml process :
  • to start a controller and n workers
  • to aggregate the result
  • to separately start a controller
  • to separately start worker(s)
  • fit a specific model

what you need to specify ?

For the automl to work you need to specify a few things:
  • a loader function

This function will load your data, it should return a DataFrame with features (dfX), the target (y), and optionnaly the groups (if you want to use a GroupedCV) It will be called only once during the initialisation phase. So if you’re loading data you don’t need to save it a shared folder accessible by all the worker. (After it is called, the auto-ml will persist everything needed)

  • a base folder : the folder on which the automl will work.

This folder should be accessible by all the workers and the controller. It will be used to save result, save the queue of jobs, the logs, …

  • set_configs function : a function to modify the settings of the automl

You can modify the cv, the base line, the scoring, … (See ml_machine_launcher_advanced for details).

Summary of the auto ml command

run command

This is the main command, it will start everything that is needed. To start the whole process, you should use the ‘run’ command, in a command windows you can run:

python automl_launcher.py run
This is the main command, it will
  1. load the data using the loader
  2. initialize everything
  3. modify configuration
  4. save everything needed to disk
  5. start one controller in a subprocess
  6. start one worker

You can also start more than one worker, to do that, the “-n” command should be used:

python automl_launcher.py run -n 4

This will create a total of 4 workers (and also 1 controller), so at the end you’ll have 5 python processes running

manual start

You can also use this script to start everything manually. That way you can
  • do the initialization manually
  • have one console for the controller
  • have separate consoles for workers

To do that you need the same steps as before.

init command

If you only want to initialize everything, you can run the ‘init’ command:

python automl_launcher.py init

This won’t start anything (no worker, no controller), but will load the data, prepare the configuration and apply the change and persist everything to disk.

manual init

alternatively you can do that manually in a notebook or your favorite IDE. That way you can actually see what the default configuration, prepare the data, etc.

Here is the code to do that:

launcher.MlMachineLauncher(base_folder="C:/automl/titanic", loader=loader)
launcher.initialize()
launcher.job_config.base_line = 0.75
launcher.auto_ml_config.columns_informations["Pclass"]["TypeOfVariable"] = "TEXT"

# ... here you can take a look at job_config and auto_ml_config
# ... any other change

launcher.persist()

controller command

If you only want to start a controller, you should use the ‘controller’ command:

python automl_launcher.py controller

This will start one controller (in the main process)

worker command

If you only want to start worker(s) you should use the ‘worker’ command:

python automl_launcher.py worker -n 2

This will start 2 workers (one in main process and one in a subprocess). For it to do anything a controller needs to be started elsewhere. This command is useful to add new workers to an existing task, or to add new worker on another computer (assuming the controller is running elsewhere).

result command

If you want to launch the aggregation of result, you can use the ‘result’ command:

python automl_launcher.py result

This will trigger the results aggregations and generate the excel result file

stop command

If you want to stop every process, you can use the ‘stop’ command:

python automl_launcher.py stop

It will create the stop file that will trigger the exit of all process listening to that folder

fit command

If you want to fit one or more specific model(s), you can use the ‘fit’ command. You’ll need to specify the job_id(s) to fit:

python automl_launcher.py fit --job_ids 77648ab95306e564c4c230e8469e9470

Or:

python automl_launcher.py fit --job_ids 77648ab95306e564c4c230e8469e9470,469ee473a55a4d1376d3c3186c95f048

To fit more that one model. The models will be saved within ‘saved_models’ along with their json.

Summary

To start a new experiment, first create the script with the example above then use run command.

If you want to split everything you can use

  1. launcher.initialize()
  2. apply modifications
  3. launcher.persist()
  4. controller command
  5. worker command

Whenever you want an aggregation of results : result command

Auto Ml Advanced functionnalities

The launcher script can be used to specify lots of things in the ml machine. The auto-ml makes lots of choices by default which can be changed. To change those you need to modify the ‘job_config’ object or the ‘auto_ml_config’ object.

Within the launcher, you can use the ‘set_configs’ function to do just that.

Here is a complete example:

def set_configs(launcher):
    """ this is the function that will set the different configurations """
    # Change the CV here :
    launcher.job_config.cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=123) # specify CV

    # Change the scorer to use :
    launcher.job_config.scoring = ['accuracy', 'log_loss_patched', 'avg_roc_auc', 'f1_macro']

    # Change the main scorer (linked with )
    launcher.job_config.main_scorer = 'accuracy'

    # Change the base line (for the main scorer)
    launcher.job_config.score_base_line = 0.8

    # Allow 'approx cv or not :
    launcher.job_config.allow_approx_cv = False

    # Allow 'block search' or not :
    launcher.job_config.do_blocks_search = True

    # Start with default models or not :
    launcher.job_config.start_with_default = True

    # Change default 'columns block' : use for block search
    launcher.auto_ml_config.columns_block = OrderedDict([
          ('NUM', ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']),
         ('TEXT', ['name', 'ticket']),
         ('CAT', ['sex', 'cabin', 'embarked', 'boat', 'home_dest'])])

    # Change the list of models/transformers to use :
    launcher.auto_ml_config.models_to_keep = [
                 #('Model', 'LogisticRegression'),
                 ('Model', 'RandomForestClassifier'),
                 #('Model', 'ExtraTreesClassifier'),

                 # Example : keeping only RandomForestClassifer

                 ('FeatureSelection', 'FeaturesSelectorClassifier'),

                 ('TextEncoder', 'CountVectorizerWrapper'),

                 #('TextPreprocessing', 'TextNltkProcessing'),
                 #('TextPreprocessing', 'TextDefaultProcessing'),
                 #('TextPreprocessing', 'TextDigitAnonymizer'),

                 # => Example: removing TextPreprocessing

                 ('CategoryEncoder', 'NumericalEncoder'),
                 ('CategoryEncoder', 'TargetEncoderClassifier'),

                 ('MissingValueImputer', 'NumImputer'),

                 ('DimensionReduction', 'TruncatedSVDWrapper'),
                 ('DimensionReduction', 'PCAWrapper'),

                 ('TextDimensionReduction', 'TruncatedSVDWrapper'),
                 ('DimensionReduction', 'KMeansTransformer'),
                 ('Scaling', 'CdfScaler')
                 ]

    # Specify the type of problem
    launcher.auto_ml_config.type_of_problem = 'CLASSIFICATION'

    # Specify special hyper parameters : Example
    launcher.auto_ml_config.specific_hyper = {
            ('Model', 'RandomForestClassifier') : {"n_estimators":[10,20]}
            }
    # Example : only test n_estimators to be 10 or 20

    return launcher

Job Config Option

The ‘job_config’ object stores information related to the way we will test the models : like the Cross-Validation object to use, the base-line, …

  1. Cross-Validation

You can change the ‘cv’ to use. Simply change the ‘cv’ attribute.

Remark : if you specify ‘cv = 10’, the type of CV will be guessed (StratifiedKFold, KFold, …)

Remark : you can use a special ‘RandomTrainTestCv’ or ‘IndexTrainTestCv’ object to do only a ‘train/test’ split and not a full cross-validation.

  1. Scorings

Here you can change the scorers to be used. Simplify specify a list with the name of scorers. You can also directly pass sklearn scorers.

  1. Main Scorer

This specify the main scorer, it is used to compute the base-line. It can also be used to ‘guide’ the auto-ml.

  1. Approximate CV

If you want to allow that or not. The idea of ‘approximate cv’ is to gain time by by-passing the CV of some transformers : if a transformers doesn’t depend on the target you can reasonably skip cross-validation without having much leakage.

  1. Do Block Search

If True, the auto-ml will do special jobs where the model is fixed, the preprocessing pipeline is the default one, but it tries to remove some of the block of columns. It also tries to use only one block. Those jobs helps figure out what block of features are important or not.

  1. Starts with default

If True, the auto-ml starts by ‘default’ models and transformers before doing the full bayesian random search.

Auto Ml Config

The ‘auto_ml_config’ object stores the information related to the data, the problem, …

  1. Columns Blocks

Using that attribute you can change the blocks of columns, by default the blocks corresponds to the type of variable (Numerical, Categories and Text) but you can specify what you want. Those blocks will be used for the ‘block search’ jobs.

  1. Models to Keep

Here you can filter the models/transformers to test.

Remark : you need to keep required preprocessing steps. For example, if you have text columns you need to keep at least one text encoder.

  1. Type of Problem

You can chage the type of problem. This is needed if the guessing was wrong.

  1. Specific Hyper Parameters

You can change the hyper parameters used, simply pass a dictionnary with keys being the models to change, and values the new hyper-parameters. The new hyper-parameters can either be a dict (as in the example above) or an object of the HyperCrossProduct class.

Usage of groups

Sometime your data falls into different groups. Sklearn allow you to pass those information to the cross-validation object to make sure the folds respect the groups. Aikit also allow you to use those groups for custom scorer. To use groups in the auto-ml the ‘loader’ function needs to returns three things instead of two : ‘dfX, y, groups’

You can then specify a special CV or a special scorer that uses the groups.

Auto ML Manual Launch

Simple Launch

Here are the steps to launch a test. For simplicity you can also use an MlMachineLauncher object (see _ml_machine_launcher)

Load a dataset ‘dfX’ and its target ‘y’ and decide a ‘name’ for the experiment. Then create an AutoMlConfig object:

from aikit.ml_machine import AutoMlConfig, JobConfig
auto_ml_config = AutoMlConfig(dfX=dfX, y=y, name=name)
auto_ml_config.guess_everything()

This object will contain all the configurations of the problem. Then create a JobConfig object:

job_config = JobConfig()
job_config.guess_cv(auto_ml_config=auto_ml_config, n_splits=10)
job_config.guess_scoring(auto_ml_config=auto_ml_config)

The data (dfX and y) can be deleted from the AutoMlConfig after everything is guessed to save memory:

auto_ml_config.dfX = None
auto_ml_config.y = None

If you have an idea of a base line score you can set it:

job_config.score_base_line = 0.95

Create a DataPersister object (for now only a FolderDataPersister object is available but other persisters using database should be possible):

from aikit.ml_machine import FolderDataPersister
base_folder = "automl_folder_experiment_1"
data_persister = FolderDataPersister(base_folder=base_folder)

Now everything is ready, a JobController can be created and started (this controller can use a AutoMlModelGuider to help drive the random search):

from aikit.ml_machine import AutoMlResultReader, AutoMlModelGuider, MlJobManager, MlJobRunner
result_reader = AutoMlResultReader(data_persister)

auto_ml_guider = AutoMlModelGuider(result_reader=result_reader,
                                   job_config=job_config,
                                   metric_transformation="default",
                                   avg_metric=True)

job_controller = MlJobManager(auto_ml_config=auto_ml_config,
                            job_config=job_config,
                            auto_ml_guider=auto_ml_guider,
                            data_persister=data_persister)

Same thing one (or more workers) can be created:

job_runner = MlJobRunner(dfX=dfX,
                   y=y,
                   auto_ml_config=auto_ml_config,
                   job_config=job_config,
                   data_persister=data_persister)

To start the controller just do:

job_controller.run()

To start the worker just do:

job_runner.run()
Remark :
  • the runner and the worker(s) should be started seperately (for example the controller in a special thread, or in a special process).
  • the controller doesn’t need the data (dfX and y) and they can be deletted from the AutoMlConfig after everything is guessed

Result Aggregation

After a while (whenever you want actually) you can aggregate the results and see them. The most simple way to do that is to aggregate everything into an Excel file. Create an AutoMlResultReader:

from aikit.ml_machine import AutoMlResultReader, FolderDataPersister

base_folder = "automl_folder_experiment_1"
data_persister = FolderDataPersister(base_folder = base_folder)

result_reader = AutoMlResultReader(data_persister)

Then retrieve everything:

   df_results = result_reader.load_all_results()
   df_params  = result_reader.load_all_params()
   df_errors  = result_reader.load_all_errors()

* df_results is a DataFrame with all the results (cv scores)
* df_params is a DataFrame with all the parameters of each models
* df_error is a DataFrame with all the errors that arises when fitting the models

Everything is linked by the ‘job_id’ column and can be merged:

df_merged_result = pd.merge(df_params, df_results, how="inner", on="job_id")
df_merged_error  = pd.merge(df_params, df_errors , how="inner", on="job_id")

And saved into an Excel file:

df_merged_result.to_excel(base_folder + "/result.xlsx", index=False)

Result File

automl result file

Here are the parts of the file:

  • job_id : idea of the current model (this is id is used everywhere)
  • hasblock_** columns : indicate whether or not a given block of column were used
  • steps columns indicating which transformer/model were used for each step (Example : “CategoricalEncoder”:”NumericalEncoder”)
  • hyper-parameter columns : indicate the hyperparameters for each model/transformers
  • test_** : scoring metrics on testing data (ie: out of fold)
  • train_** : scoring metrics on training data (ie: in fold)
  • time, time_score : time to fit and score each model
  • nb : number of fold (not always max because sometime we don’t do the full crossvalidation if performance are not good)
automl result file
automl result file

aikit proposes a tool to automatically search among machine learning models and preprocessings to find the best one(s).

To do that the algorithm needs an ‘X’ DataFrame and a target ‘y’ and that is to be predicted. The algorithm starts to guess everything that is needed:

  • the type of problem (regression, classification)
  • the type of each variable (categorical, text or numerical)
  • the models/transformers to use
  • the scorer to use
  • the type of cross-validation to use

Everything can be overrided by the the user if needed. (See Auto Ml Advanced functionnalities)

A folder also needs to be set because everything will transit on disk and be saved in that folder.

Once everything is set a job controller should be launched. Its job will be to create new models to try. Then, one (or more) workers should be launched to actually do the job and test the model.

After a while the result can be seen and a model can be chosen.

The process is more or less the following (see after for detailed)

  1. the controller creates a random model (see detailed after)
  2. one worker picks up that model and cross validates it
  3. the controller picks up the result to help drive the random search
  4. after a while the result can be aggregated to choose the model

Everything can be driven via a script.

The easiest way to start the auto ml is to create a script like the following one (and save it as ‘automl_launcher.py’ for example)

Example:

from aikit.datasets import load_dataset, DatasetEnum
from aikit.ml_machine import MlMachineLauncher

def loader():
    """ modify this function to load the data

    Returns
    -------
    dfX, y

    Or
    dfX, y, groups

    """
    dfX, y, *_ = load_dataset(DatasetEnum.titanic)
    return dfX, y

def set_configs(launcher):
    """ modify that function to change launcher configuration """

    launcher.job_config.score_base_line = 0.75
    launcher.job_config.allow_approx_cv = True

    return launcher

if __name__ == "__main__":
    launcher = MlMachineLauncher(base_folder = "C:/automl/titanic",
                                 name = "titanic",
                                 loader = loader,
                                 set_configs = set_configs)

    launcher.execute_processed_command_argument()

The only thing to do is to replace the loader function by a function that loads your own data. Once that is done, just run the following command

python automl_launcher.py run -n 4

It will starts the process using 4 workers (you can change that number if you have more or less processes available).

Here is a diagram that summarize what is going and explained the different functionnalities. For a complete explanation of all the command please look at the Auto ML Launcher page.

Summary of the auto ml command

For a detailed explanation about how the auto-ml is working you can go here:

Auto ML Inner Workings

Here is a more detailed explanation about what the Ml Machine is doing.

Steps

Each transformation are grouped in steps. There can be several steps needed for each DataFrame, and the needed steps depend on the type of problem/variable. Some steps are optional some are needed.

Example of such steps:

  • TextPreprocessing (optional) this steps has all the text preprocessing transformer
  • TextEncoder : (needed) this step encodes the text into numerical value (Example using CountVectorizer or Word2Vec)
  • MissingValueImputer (needed if some variable have missing values)
  • CategorieEncoder (needed if some variables are categorical)
  • Model (needed) : last step consisting of the prediction model

see complete list with aikit.enums.StepCategories

The Ml Machine will randomly draw one model per step and merge them into a complex processing pipeline. Optional steps are sometimes drawn and sometime not.

(The transformers that are drawn are the one in the ml machine registry : Model Register)

First Rounds

Before randomly selected pipelines, a first round of models are tested using:
  • default parameters
  • without all the optional steps

Usually those pipelines should perform relatively well and gives a good idea about what work and what doesn’t.

Next Rounds

Once that is done, random rounds are started. For those, random models are drawn:

  • for each step, draw a random transformation (or ignore the step if it is optional)
  • for each model draw random hyper-parameters
  • if block of variable were setted, randomly draw a subset of those blocks
  • merge everythihg into a complexe graph

That model is then send to the worker to be cross-validated.

Stopping Threshold

For each given model the worker aims to do a full cross-validation. However the cross-validation can be stopped after the first fold if the result are too low (bellow a threshold fixed by the controller).

That threshold is computed using :
  • the base line score if it exists
  • a quantile on already done result

(See aikit.cross_validation.cross_validation() which is used to compute the cross-validation)

Guided Job

After a few models, with a given random probability, the controller will start to create Guided Jobs. Those jobs are not random anymore but uses BayesianOptimization to try to guess a model that will perform correctly.

Concretely a meta model is fitted to try to predict performance based on hyper-paramters and transformers/models choices. And instead we use that meta model to predict wheither or not a candidate model will perform or not.

Random Model Generator

The random model generator can be used outside of the Ml Machine:

from aikit.ml_machine.ml_machine import RandomModelGenerator
generator = RandomModelGenerator( auto_ml_config = auto_ml_config)

Graph, all_models_params, block_to_use = generator.draw_random_graph()
The generator returns three things:
  1. Graph : networkx graph of the model
  2. all_models_params : dictionnary with all the hyper-parameters of all the transformers/models
  3. block_to_use : the block of columns to use

With this 3 objects the json of a model can be created:

from aikit.ml_machine.model_graph import convert_graph_to_code
model_json_code = convert_graph_to_code(Graph, all_models_params

And then a working model can be created:

from aikit.model_definition import sklearn_model_from_param
skmodel = aikit.model_definition.sklearn_model_from_param(model_json_code)

You can download and adapt the scripts here

Transformer

aikit offers some transformers to help process the data. Here is a brief description of them.

Some of those transformers are just relatively thin wrapper around what exists in sklearn, some are existing techniques packaged as transformers and some things built from scratch.

aikit transformers are built using a Model Wrapper

There is a more detailed explanation about the aikit.transformers.model_wrapper.ModelWrapper class. It will explain what the wrapper is doing and how to wrap new models. It will also explain some common functionnalities of all the transformers in aikit.

Wrapper Goal

The aim of the wrapper is to provide a generic class to handle most of the redondant operations that we might want to apply in a transformer. In particular it aims at making regular ‘sklearn like’ model more generic and more ‘user friendly’.

Here are a few things the wrapper offer to aikit transformers :

  • automatic conversion of input/output into a given format (which is useful when chaining models and some on them accepts DataFrame, some don’t, …)
  • verification of type, shape of new data
  • shape conversion for model that only accept ‘1-dimensional’ input
  • automatic splits and concatenation of result for models that only work one column at a time (See : CountVectorizerWrapper)
  • generation of features_names and usage of those names when the output is a DataFrame
  • delay the creation of underlying model until the fit() is called. This allow to customize hyper-parameters based on the data (Ex : n_components can be a percentage of the number of columns given in input).
  • automatic or manual selection the columns the transformers is supposed to work on.

Let’s take sklearn sklearn.feature_extraction.text.CountVectorizer as an example. The transformer has the logic implemented however it can sometimes be a little difficult to use :

  • if your data has more than one text column you need more than once CountVectorizer and you need to concatened the result

Indeed CountVectorizer work only on 1 dimensional input (corresponding to a text Serie or a text list)

  • if your data is relatively small you might want to retrieve a regular pandas DataFrame and not a scipy.sparse matrix which might not work with your following steps
  • you might want to have feature_names that not only correspond to the ‘word/char’ but also tells from which column it comes from. Example of such column name : ‘text1_BAG_dog’
  • you might want to tell the CountVectorizer to work on specific columns (so that you don’t have to take care of manually splitting your data)

As a consequence it also make the creation of a “sklearn compliant” model (ie : a model that works well within the sklearn infrastructure easy : clone, set_params, hyper-parameters search, …)

Wrapping the model makes the creation of complexe pipleline like the in GraphPipeline a lot easier.

To sum up the aim of the wrapper is to separate :
  1. the logic of the transformer
  2. the mechanical data transformation, checks, … needed to make the transformer robust and easy to use

Selection of the columns

The transformers present in aikit are able to select the columns they work on via an hyper-parameter called ‘columns_to_use’.

For example:

from aikit.transformers import CountVectorizerWrapper vectorizer = CountVectorizerWrapper(columns_to_use=[“text1”,”text2”])

the preceding vectorizer will encode “text1” and “text2” using bag-of-word.

The parameter ‘columns_to_use’ can be of several type :
  • list of strings : list of columns by name (assuming a DataFrame input)
  • list of integers : list of columns by position (either a numpy array or a DataFrame
  • special string “all” : means all the columns are used
  • DataTypes.CAT : use aikit detection of columns type to keep only categorical columns
  • DataTypes.TEXT : use aikit detection of columns type to keep only textual columns
  • DataTypes.NUM : use aikit detection of columns type to keep only numerical columns
  • other string like ‘object’ : use pandas.select_dtype to filter based on type of column

Remark : when a list of string is used the ‘use_regex’ attribute can be set to true. In that case the ‘columns_to_use’ are regexes and the columns retrieved are the one that match one of the regexes.

Here are a few examples:

Encoding only one or two columns by name:

from aikit.transformers import CountVectorizerWrapper
vectorizer = CountVectorizerWrapper(columns_to_use=["text1","text2"])

Encoding only ‘TEXT’ columns:

from aikit.transformers import CountVectorizerWrapper
from aikit.enums import DataTypes
vectorizer = CountVectorizerWrapper(columns_to_use=DataTypes.TEXT)

Encoding only ‘object’ columns:

from aikit.transformers import CountVectorizerWrapper
from aikit.enums import DataTypes
vectorizer = CountVectorizerWrapper(columns_to_use="object")

Drop Unused Columns

aikit transformer can also decided what you do with the columns you didn’t encode. By default most transformer drop those columns. That way at the end of the transformer you retrieve only the encoded columns.

The behavior is setted by the parameter ‘drop_used_columns’:
  • True : means you have only the encoded ‘columns_to_use’ result at the end
  • False : means you have the encoded ‘columns_to_use’ + the other columns (un-touched by the transformer)

This can make it easier to transformed part of the data.

Remark : the only transformers that have ‘drop_used_columns = False’ as default are categorical encoder. That way they automatically encoded the categorical columns but keep the numerical column un-touched. Which means you can plug that at the begginning of your pipeline.

Drop Used Columns

You can also decided if you want to keep the ‘columns_to_use’ in their original format (pre-encoding). To do that you need to specify ‘drop_used_columns=False’. If you do that you’ll have both encoded and non-encoded value after the transformers. This can be usefull sometimes.

For example, let’s say that you want to do an SVD but you also want to keep the original columns (so the SVD is not reducing the dimension but adding new compositie features). You can do it like that:

from aikit.transformers import TruncatedSVDWrapper
svd = TruncatedSVDWrapper(columns_to_use="all", n_components=5, drop_used_columns=False)

You can wrap your own model

You can use aikit wrapper for your own model, this is useful if you want to code a new transformer but you don’t want to think about all the details to make it robust.

See :

How to Wrap a transformer

To wrap a new model you should
  1. Create a new class that inherit from ModelWrapper
  2. In the __init__ of that class specify the rules of the wrapper (see just after)
  3. create a _get_model method to specify the underlying transformers
A few notes:
  • must_transform_to_get_features_name and dont_change_columns are here to help the wrapped transformers to implement a correct ‘get_feature_names’
  • the wrapped model has a ‘model’ attribute that retrieves the underlying transformer(s)
  • the wrapped model will generate a NotFittedError error when called without being fit first (this behavior is not consistent across all transformers)

Here is an example of how to wrap sklearn CountVectorizer:

class CountVectorizerWrapper(ModelWrapper):
    """ wrapper around sklearn CountVectorizer with additionnal capabilities

     * can select its columns to keep/drop
     * work on more than one columns
     * can return a DataFrame
     * can add a prefix to the name of columns

    """
    def __init__(self,
                 columns_to_use = "all",
                 analyzer = "word",
                 max_df = 1.0,
                 min_df = 1,
                 ngram_range = 1,
                 max_features = None,
                 columns_to_use = None,
                 regex_match    = False,
                 desired_output_type = DataTypes.SparseArray
                 ):

        self.analyzer = analyzer
        self.max_df = max_df
        self.min_df = min_df
        self.ngram_range = ngram_range
        self.columns_to_use = columns_to_use
        self.regex_match    = regex_match
        self.desired_output_type = desired_output_type

        super(CountVectorizerWrapper,self).__init__(
            columns_to_use = columns_to_use,
            regex_match    = regex_match,

            work_on_one_column_only = True,
            all_columns_at_once = False,
            accepted_input_types = (DataTypes.DataFrame,DataTypes.NumpyArray),
            column_prefix = "BAG",
            desired_output_type = desired_output_type,
            must_transform_to_get_features_name = False,
            dont_change_columns = False)


    def _get_model(self,X,y = None):

        if not isinstance(self.ngram_range,(tuple,list)):
            ngram_range = (1,self.ngram_range)
        else:
            ngram_range = self.ngram_range

        ngram_range = tuple(ngram_range)

        return CountVectorizer(analyzer = self.analyzer,
                               max_df = self.max_df,
                               min_df = self.min_df,
                               ngram_range = ngram_range)

And here is an example of how to wrap TruncatedSVD to make it work with DataFrame and create columns features:

class TruncatedSVDWrapper(ModelWrapper):
    """ wrapper around sklearn TruncatedSVD

    * can select its columns to keep/drop
    * work on more than one columns
    * can return a DataFrame
    * can add a prefix to the name of columns

    n_components can be a float, if that is the case it is considered to be a percentage of the total number of columns

    """
    def __init__(self,
                 n_components = 2,
                 columns_to_use = "all",
                 regex_match  = False
                 ):
        self.n_components = n_components
        self.columns_to_use = columns_to_use
        self.regex_match    = regex_match

        super(TruncatedSVDWrapper,self).__init__(
            columns_to_use = columns_to_use,
            regex_match    = regex_match,

            work_on_one_column_only = False,
            all_columns_at_once = True,
            accepted_input_types = None,
            column_prefix = "SVD",
            desired_output_type = DataTypes.DataFrame,
            must_transform_to_get_features_name = True,
            dont_change_columns = False)


    def _get_model(self,X,y = None):

        nbcolumns = _nbcols(X)
        n_components = int_n_components(nbcolumns, self.n_components)

        return TruncatedSVD(n_components = n_components)

What append during the fit

To help understand a little more what goes on, here is a brief summary the fit method

  1. if ‘columns_to_use’ is set, creation and fit of a aikit.transformers.model_wrapper.ColumnsSelector to subset the column
  2. type and shape of input are stored
  3. input is converted if it is not among the list of accepted input types
  4. input is converted to be 1 or 2 dimensions (also depending on what is accepted by the underlying transformer)
  5. underlying transformer is created (using ‘_get_model’) and fitted
  6. logic is applied to try to figure out the features names

Other Transformers

For a full Description of aikit Transformers go there :

All Transformers

Here is a list of some of aikit transformers

Text Transformer
TextDigitAnonymizer
TextNltkProcessing

This is another text pre-processing transformers that does classical text transformations.

CountVectorizerWrapper

Wrapper around sklearn CountVectorizer.

Word2VecVectorizer

This model does Continuous bag of word, meaning that it will fit a Word2Vec model and then average the word vectors of each text.

Char2VecVectorizer

This model is the equivalent of a “Bag of N-gram” characters but using embedding. It is fitting embedding for sequence of caracters and then average all those embeddings.

Dimension Reduction
TruncatedSVDWrapper

Wrapper around sklearn TruncatedSVD.

KMeansTransformer

This transformers does a KMeans clustering and uses the cluster to generate new features (based on the distance between each cluster). Remark : for the ‘probability’ result_type, since KMeans isn’t a probabilistic model the probability is computed using an heuristic.

Feature Selection
FeaturesSelectorRegressor

This transformer will perform feature selection. Different strategies are available:

  • “default” : uses sklearn default selection, using correlation between target and variable
  • “linear” : uses absolute value of scaled parameters of a linear regression between target and variables
  • “forest” : uses feature_importances of a RandomForest between target and variables
FeaturesSelectorClassifier

Exactly as aikit.transformers.base.FeaturesSelectorRegressor but for classification.

Missing Value Imputation
NumImputer

Numerical value imputer for numerical features.

Categories Encoding
NumericalEncoder

This is a transformer to encode categorical variable into numerical values.

The transformer handle two types of encoding:

  • ‘dummy’ : dummy encoding (aka : one-hot-encoding)
  • ‘num’ : simple numerical encoding where each modality is transformed into a number

The transformer includes also other capabilities to simplify encoding pipeline:

  • merging of modalities with too few observations to prevent huge result dimension and overfitting,
  • treating None has a special modality if many None are present,
  • if the columns are not specified, guess the columns to encode based on their type
CategoricalEncoder

This is a wrapper around module:categorical_encoder package.

TargetEncoderRegressor

This transformer also handles categorical encoding but uses the target to do that. The idea is to encode each modality into the mean of the target on the given modality. To do that correctly special care should be taken to prevent leakage (and overfitting).

The following techniques can be used to limit the issue :

  • use of an inner cross validation loop (so an observation in a given fold will be encoded using the average of the target computed on other folds)
  • noise can be added to encoded result
  • a prior corresponding to the global mean is apply, the more observations in a given modality the less weight the prior has

The precautions explained above causes the transformer to have a different behavior when doing:

  • fit then transform
  • fit_transform

When doing fit then transform, no noise is added during the transformation and the fit save the global average of the target. This is what you’d typically want to do when fitting on a training set and then applying the transformation on a testing set.

When doing fit_transform, noise can be added to the result (if noise_level != 0) and the target aggregats are computed fold by fold.

To understand better here is what append when fit is called :

  1. variables to encode are guessed (if not specified)
  2. global average per modality is computed
  3. global average (for all dataset) is computed (to use as prior)
  4. global standard deviation of target is computed (used to set noise level)
  5. for each variable and each modality compute the encoded value using the global aggregat and the modality aggregat (weighted by a function of the number of observations for that modality)

Now here is what append when transform is called :

  1. for each variable and each modality retrieve the corresponding value and use that numerical feature

Now when doing a fit_transform :

  1. call fit to save everything needed to later be able to transform unseen data
  2. do a cross validation and for each fold compute aggregat and the remaining fold
  3. use that value to encode the modality
  4. add noise to the result : proportional to noise_level * global standard deviation
TargetEncoderClassifier

This transformer handles categorical encoding and uses the target value to do that. It is the same idea as TargetEncoderRegressor but for classification problems. Instead of computing the average of the target, the probability of each target classes is used.

The same techniques are used to prevent leakage.

Other Target Encoder

Any new target encoder can easily be created using the same technique. The new target encoder class must inherit from _AbstractTargetEncoder, then the aggregating_function can be overloaded to compute the needed aggregat.

The _get_output_column_name can also be overloaded to specify feature names.

Scaling
CdfScaler

This transformer is used to re-scale feature, the re-scaling is non linear. The idea is to fit a cdf for each feature and use it to re-scale the feature to be either a uniform distribution or a gaussian distribution.

Target Transformation
BoxCoxTargetTransformer

This transformer is a regression model that modify that target by applying it a boxcox transformation. The target can be positive or negative. This transformation is useful to flatten the distribution of the target which can help underlying model (especially those who are not robust to outliers).

Remark : It is important to note that when predicting the inverse transformation will be applied. If what is important to you is the error on the logarithm of the error you should:

  • directly transform you target before anything
  • use a customized scorer

Example of transformation using ll = 0:

boxcox transformation with ``ll = 0``

When ll increases the flattenning effect diminishes :

boxcox transformations family

Other Functionnalities

Aikit offers other functionnalities that might be usefull for the DataScientist job.

Model Json

A function to back and forth between a json-like object and a sklearn-like model that can be fitted. This can be used to store the definition of a complete model into a json file for your projects.

ModelJson

Model representation

It is sometime useful to specify the model to use for a given use-case outside of the main code of the project. For example in a json like object. This can have several advantages :

  • allow the change of the underlying model without change any code (example : shift from a RandomForestClassifier to a LGBMClassifier)
  • allow the same code to be used for different sub problem BUT allowing specific hyper-parameters/models for each sub problems
  • easier to incorporate model that were found automatically by an ml_machine

To be able to do that we need to save the description of a complex model into a simple json like format.

The syntax is easy : a model is represented by a tuple with its name and its hyper-parameters.

Example, the model:

RandomForestClassifier(n_estimators=100)

is represented by the object:

("RandomForestClassifier",{"n_estimators":100})

So : klass(**kwargs) is equivalent to (‘klass’,kwargs)

Let’s take a more complexe example using a GraphPipeline:

gpipeline = GraphPipeline(models = {"vect" : CountVectorizerWrapper(analyzer="char",ngram_range=(1,4)),
                                        "svd"  : TruncatedSVDWrapper(n_components=400) ,
                                        "logit" : LogisticRegression(class_weight="balanced")},
                               edges = [("vect","svd","logit")]
                               )

is represented by:

json_object = ("GraphPipeline", {"models": {"vect" : ("CountVectorizerWrapper"  , {"analyzer":"char","ngram_range":(1,4)} ),
                             "svd"  : ("TruncatedSVDWrapper"     , {"n_components":400}) ,
                             "logit": ("LogisticRegression" , {"class_weight":"balanced"}) },
                  "edges":[("vect","svd","logit")]
                  })

So if a given model uses other models as parameters it works as well.

Model conversion

Once the object is create you can convert it to a real (unfitted) model using aikit.model_definition.sklearn_model_from_param()

sklearn_model_from_param(json_object)

which gives a model that can be fitted.

Json saving

That representation uses only simple types that are json serializable (string, number, list, dictionnary) and the json can be saved on disk.

Remark : since json doesn’t allow :
  • tuple (only list are known)
  • dictionnary with non string keys

it is best to overried the json serializer to handle those type. The special encoder is found in :module:`aikit.tools.json_helper` and ‘save_json’ and ‘load_json’ can be used directly

Example saving the ‘json_object’ above:

from aikit.tools.json_helper import save_json
save_json(json_object, fname ="model.json")

realoaded_json_object = load_json("model.json")
The special serializer works by transforming un-handle type into a dictionnary with
  • a ‘__items__’ key with a list of object
  • a ‘__type__’ key with the original type

Example:

("a","b")

is transformed into:

{"__items__":["a","b"], "__type__":"__tuple__"}
The handle types are :
  • dict : ‘__dict__’
  • tuple

Model Register

To be able to use a given model using only its name all the models should be registred in a dictionnary.

This is done within aikit.simple_model_registration, in that file you have a DICO_NAME_KLASS object which stored the classes of every model. To add a new model simple use the add_klass method.

Example:

DICO_NAME_KLASS.add_klass(LGBMClassifier)
DICO_NAME_KLASS.add_klass(LGBMRegressor)

Remark : this registrer is different from the one used for the automatic machine learning part (ml_machine) which contain more informations (hyper-parameters, type, …)

Stacking

Tools to do stacking easily without having to re-code everything from scratch.

Model Stacking

Here you’ll find the explanation about how to fit stacking models using aikit.

You can also see the notbook :

Stacking How To

Stacking (or Blending) is done when you want to aggregate the result of several models into an aggregated prediction. More precisely you want to use the predictions of the models as input to another blending model. (In what follows I’ll call models the models that we want to use as features and blender the model that uses those as features to make a final predictions)

To prevent overfitting you typically want to use out-sample predictions to fit the blender (otherwise the blender will just learn to trust the model that overfit the most…).

To generate out-sample predictions you can do a cross-validation:

for fold (i) and a given model :

  1. fit model on all the data from the other folds (1,…,i-1,i+1, .. N)
  2. call fitted model on data from fold (i)
  3. store the predictions

If that is done on all the folds, you can get predictions for every sample in your database, but each prediction were generated using out sample training set.

Now you can take all those predictions and fit a blender model.

StackerClassifier

This class does exactly what is describeb above. It takes as input a list of un-fitted models as well as a blending model and will generate a stacking model. In that case what is given to the blending model are the probabilities of each class for each model.

StackerRegressor

Same class but for regression problems

OutSamplerTransformer

The library offers another way to create stacking model. A stacking model can be viewed as another kind of pipeline : instead of a transformation it just uses other models as a special kind of transformers. And so GraphPipeline can be used to chain this transformation with a blending model.

To be able to do that two things are needed:
  1. a model is not a transformer, and so we need to transform a model into a transformer (basically saying that the :function:`transform` method should use :function:`predict` or :function:`predict_proba`)
  2. to do correct stacking we need to generate out-sample predictions

The OutSamplerTransformer does just that. It takes care of generating out-sample predictions, and the blending can be done in another node of the GraphPipeline (see example bellow)

Example

Let’s say we want to stack a RandomForestClassifier, an LGBMClassifier and a LogisticRegression. You can create the model like that:

stacker = StackerClassifier( models = [RandomForestClassifier() , LGBMClassifier(), LogisticRegression()],
                             cv = 10,
                             blender = LogisticRegression()
                            )

and then fit it as you would a regular model:

stacker.fit(X, y)

Using OutSamplerTransformer we can achieve the same thing but with a Pipeline instead:

from sklearn.model_selection import StratifiedKFold
from aikit.models import OutSamplerTransformer
from aikit.pipeline import GraphPipeline
cv = StratifiedKFold(10, shuffle=True, random_state=123)

stacker = GraphPipeline(models = {
    "rf"   : OutSamplerTransformer(RandomForestClassifier() , cv = cv),
    "lgbm" : OutSamplerTransformer(LGBMClassifier()         ,  cv = cv),
    "logit": OutSamplerTransformer(LogisticRegression()     , cv = cv),
    "blender":LogisticRegression()
    }, edges = [("rf","blender"),("lgbm","blender"),("logit","blender")])
stacking with a pipeline
Remark:
  1. the 2 models are equivalents
  2. to have regular stacking the same cvs should be used every where (either by creating it before hand, by setting the random state or using a non shuffle cv)
With this idea we can do more complicated things like :
  1. deep stacking with more than one layer : simply add other layer to the GraphPipeline
  2. create a blender that uses both predictions of models as well as the features (or part of it) : simply add another node linked to the blender (a PassThrough node for example)
  3. do pre-processing before doing any stacking (and so doing it out-side of the cv loop)

For example:

stacker = GraphPipeline(models = {
    "rf"   : OutSamplerTransformer(RandomForestClassifier() , cv = cv),
    "lgbm" : OutSamplerTransformer(LGBMClassifier() , cv = cv),
    "logit": OutSamplerTransformer(LogisticRegression(), cv = cv),
    "pass" : PassThrough(),
    "blender":LogisticRegression()
    }, edges = [("rf","blender"),
                ("lgbm","blender"),
                ("logit","blender"),
                ("pass", "blender")
                ])
stacking with a pipeline (with features)

Or:

stacker = GraphPipeline(models = {
    "enc"  : NumericalEncoder(),
    "imp"  : NumImputer(),
    "rf"   : OutSamplerTransformer(RandomForestClassifier() , cv = cv),
    "lgbm" : OutSamplerTransformer(LGBMClassifier() , cv = cv),
    "logit": OutSamplerTransformer(LogisticRegression(), cv = cv),
    "blender":LogisticRegression()
    }, edges = [("enc","imp"),
                ("imp","rf","blender"),
                ("imp","lgbm","blender"),
                ("imp","logit","blender")
                ])
stacking with a pipeline (with preprocessing)

And lastly:

stacker = GraphPipeline(models = {
    "enc"  : NumericalEncoder(columns_to_use= ["cat1","cat2","num1","num2"]),
    "imp"  : NumImputer(),
    "cv"   : CountVectorizerWrapper(columns_to_use = ["text1","text2"]),
    "logit": OutSamplerTransformer(LogisticRegression(), cv = cv),
    "lgbm" : OutSamplerTransformer(LGBMClassifier() , cv = cv),
    "blender":LogisticRegression()
    }, edges = [("enc","imp","lgbm","blender"),
                ("cv","logit","blender")
                ])
stacking with a pipeline
This last example shows a model where you have:
  • categorical and numerical data, on which you apply a classical categorie encoder, you fill missing value and use gradient boosting
  • textual data which you can encode using a CountVectorizer and use LogisticRegression

Both models can be mixted with a Logistic Regression blending. Doing that just create an average between the predictions, the only difference is that since the blender is fitted, weights are in a sence optimal. (In some cases it might work better than just concatenate everything : especially since the 2 sets of features are highly different).

Another thing that can be done is to calibrate probabilities. This can be useful if your model generate meaningful scores but you can’t directly interprete those scores as probabilities:
  • if you skewed your training set to solve imbalance
  • if your model is not probabilistic

One method to re-calibrate probabilities, call Platt’s scaling , is to fit a LogisticRegression on the output of your model. If your predictions are not completely wrong, this will usualy just compute an increasing function that recalibrate your probabilities but won’t change change the ordering of the output. (roc_auc won’t be affected, but logloss, or accuracy can change).

This can also be done using OutSamplerTransformer:

rf_rescaled = GraphPipeline(models = {
    "enc"  : NumericalEncoder(),
    "imp"  : NumImputer(),
    "rf"   : OutSamplerTransformer( RandomForestClassifier(class_weight = "auto"), cv = 10),
    "scaling":LogisticRegression()
    }, edges = [('enc','imp','rf','scaling')]
)
scaling with a pipeline

OutSamplerTransformer regression mode

You can do exactly the same thing but for regresion tasks, only difference is that cross-validation uses predict() instead of predict_proba().

Data Structure Helper

You also have tools to help you with things like :
  • conversion of datatypes
  • concatenation of heterogenous datatypes
  • guess of type of columns

See :

Data Structure Helper

Data Types

Aikit helps dealing with the multiple type of data that coexist within the scikit-learn, pandas, numpy and scipy environments. Mainly :

  • pandas DataFrame
  • pandas Sparse DataFrame
  • numpy array
  • scipy sparse array (csc, csr, coo, …)

The library offers tools to easily convert between each type.

Within aikit.enums there is a DataType enumeration with the following values :

  • ‘DataFrame’
  • ‘SparseDataFrame’
  • ‘Series’
  • ‘NumpyArray’
  • ‘SparseArray’

This is better use as an enumeration but the values are actual strings so you can use the string directly if needed.

The function aikit.tools.data_structure_helper.get_type retrieve the type (one element of the element).

Example of use:

from aikit.tools.data_structure_helper import get_type, DataTypes
df = pd.DataFrame({"a":[0,1,2],"b":["aaa","bbb","ccc"]})
mapped_type = get_type(df)

if mapped_type == DataTypes.DataFrame:
    first_column = df.loc[:,"a"]
else:
    first_column = df[:,0]

Generic Conversion

You can also convert each type to the desired type. This can be useful if a transformer only accepts DataFrames, or doesn’t work with a Sparse Array, … For that use the function aikit.tools.data_structure_helper.convert_generic

Example:

from aikit.tools.data_structure_helper import convert_generic, DataTypes
df = pd.DataFrame({"a":[0,1,2],"b":["aaa","bbb","ccc"]})

arr = convert_generic(df, output_type = DataTypes.NumpyArray)

Generic Horizontal Stacking

You can also horizontally concatenate multiple datas together (assuming they have the same number of rows). You can either specify the output type you want, if that is not the case the algorithm will guess :

  • if same type will use that type
  • if DataFrame and Array use DataFrame
  • if Sparse and Non Sparse : convert to full if not to big otherwise keep Sparse

(See aikit.tools.data_structure_helper.guess_output_type)

The function to concatenate everything is aikit.tools.data_structure_helper.generic_hstack

Example:

df1 = pd.DataFrame({"a":list(range(10)),"b":["aaaa","bbbbb","cccc"]*3 + ["ezzzz"]})
df2 = pd.DataFrame({"c":list(range(10)),"d":["aaaa","bbbbb","cccc"]*3 + ["ezzzz"]})

df12 = generic_hstack((df1,df2))

Other

Two other functions that can be useful are aikit.tools.data_structure_helper.make1dimension and aikit.tools.data_structure_helper.make2dimensions. It convert to a 1 dimensional or 2 dimensional object whenever possible.

Block Manager

A special data storage that is able to contains several data of different types (but same number of observation) into one single object that is indexable

Block Selector

In order to facilitate use-cases where several types of data are presented (as well as for internal use) two things were created
  • a BlockSelector selector transformer whose job is to select a given block of data (See after)
  • a BlockManager object to store and retrieve blocks of Data.

Here is the explanation. Let’s imagine you have a use case where data has more than one type. Example you have text and regular features, or text and image. Typically to handle that you would need to put everyting into a DataFrame and then use selectors to retrieve different part of the data. This is already greatly facilitated by aikit with the GraphPipeline, the fact that wrapped models can select the variables they work on and the fact that this selection can be done uses regex (and consequently prefix). You could also deal with it manually but then cross-validation, splitting or train and test, would also have to be done manually.

However sometime it is just not praticle or possible to merge everyting into a DataFrame :
  • you might have a big sparse entry and other regular features : not praticle to merge everything into either a dense or sparse object
  • you might have a picture and regular feature and typically the pictures will be stored in a 3 or 4 dimensionals (observation x height x width x channels) tensor and not the classical 2 dimensions object
  • …. any other reason

What you can do is put ALLs your data into a dictionnary (or a BlockManager, see just after) : one key per type of data and one value per data. You you pass that object to a block selector it can retrieve each block seperately:

Xtrain = {"regular_features":pandas_dataframe_object,
          "sparse_features" :scipy_sparse_object
          }

block_selector = BlockSelector("regular_features")
df = block_selector.fit_transform(Xtrain)

Here df is just the pandas_dataframe_object object. Remark : fit_transform is used as this work like a classical transformer. Even if the transformer doesn’t do anything during the fit.

So you can just put those objects at the top of your pipeline and uses a dictionnary of data inplace of X.

The BlockSelector object also work with a list of datas (in that case the block to select is the integer corresponding to the position):

Xtrain = [pandas_dataframe_object,
          scipy_sparse_object]

block_selector = BlockSelector(0)
df = block_selector.fit_transform(Xtrain)

Example of such pipeline:

GraphPipeline(models = {
        "sel_sparse":BlockSelector("sparse_features"),
        "svd":TruncatedSVDWrapper(),
        "sel_other":BlockSelector("regular_features"),
        "rf":RandomForestClassifier()
        },
edges = [("sel_sparse","svd","rf"),("sel_other","rf")])

That model can be fitted however, you can’t cross-validate it easily. That’s is because Xtrain (which is a dictionnary of datas or a list of datas) isn’t subsetable : you can’t do Xtrain[index,:] or Xtrain.loc[index,:].

Block Manager

To solve this a new type of object is needed : the BlockManager. This object is conceptually exactly like the Xtrain object before, it can either be a dictionnary of data or a list of data. However it has a few additionnal things that allow it to work well within sklearn environnement.

  • it has a shape attribute
  • it can be subsetted using ‘iloc’

Example:

df = pd.DataFrame({"a":np.arange(10),"b":["aaa","bbb","ccc"] * 3 + ["ddd"]})
arr = np.random.randn(df.shape[0],5)
X = BlockManager({"df":df, "arr":arr})

X["df"]        # this retrieves the DataFrame df (no copy)
X["arr"]       # this retrieves the numpy array arr (no copy)
X.iloc[0:5,:]  # this create a smaller BlockManager object with only 5 observations

block_selector = BlockSelector("arr")
block_selector.fit_transform()         #This retrieve the numpy array "arr" as well

How to add new models

One of the idea of the package is to offer ways to quickly test ideas without much burden. It is thus relatively easy to add new models/transformers to the framework.

Those models/transformers can be included in the search of the auto-ml component to be tested on different databases and with other transformers/models.

A model needs to be added at two different places in order to be fully integrated within the framework.

Let’s see what’s need to be done to include an hypothetic new models:

class ReallyCoolNewTransformer(BaseEstimator, TransformerMixin):
    """ This is a great new  transformer """
    def __init__(self, super_choice):
        self.super_choice = super_choice

    def fit(self,X, y = None):
        pass

    def transform(self,X):
        pass

Add model to Simple register

This will allow the function ‘sklearn_model_from_param’ to be able to use your new model. The class simply needs to be added to the DICO_NAME_KLASS object:

from aikit.model_definition import DICO_NAME_KLASS
DICO_NAME_KLASS.add_klass(ReallyCoolNewTransformer)

Now that this is done, you can call the transformer by its name:

from aikit.model_definition import sklearn_model_from_param
model = sklearn_model_from_param(("ReallyCoolNewTransformer",{}))

model is an instance of ReallyCoolNewTransformer

Add model to Auto-Ml framework

This is a little more complicated, a few more informations need to be entered:

  • type of model
  • type of variable it uses
  • hyper-parameters

To do that you need to use the @register decorator:

from aikit.ml_machine.ml_machine_registration import register, _AbstractModelRepresentationDefault, StepCategories
import aikit.ml_machine.hyper_parameters as hp

@register
class DimensionReduction_ReallyCoolNewTransformer(_AbstractModelRepresentationDefault):
    klass = ReallyCoolNewTransformer

    category = StepCategories.DimensionReduction
    type_of_variable = None
    type_of_model = None # Used for all problem

    custom_hyper = {"super_parameters":hp.HyperChoice(("superchoice_a","superchoice_b"))}

See Model Register for complete description of register. See Hyperparameters for complete description of register.

Remark: The registers behaves like singletons so you can modify them in any part of the code. You just need the code to be executed somewhere for it to work.

If a model is stable and tested enough the new entry can be added to the python files :

  • ‘model_definition.py’ : for the simple register
  • ml_machine/ml_machine_registration.py : for the full auto-ml register

(See Contribution for detailed about how to contribute to the evolution of the library)

Remark : you don’t need to use the wrapper for your model to be incorporated in the framework. However, it is best to do so. That way you can focus on the logic and let the wrapper make your model more generic.

Model Register

To be able to randomly create complex processing pipelines the Ml Machine needs to know a few things.

  1. The list of models/transformers to use
  2. Information about each models/transformations
For each models/transformers here are the information needed:
  1. the step on which is will be used : CategorieEncoder, TextProcessing, …
  2. the type of variable it will be used on : Text, Numerical, Categorie or None (for everything)
  3. can it be used for regression, classification or both
  4. the list of hyper-parameters and their values
  5. … any other needed piece of information

To be able to that each transformers and models should be register and everything is stored into an object.

All that is done within aikit.ml_machine.ml_machine_registration

To register a new model simply create a class and decorate it:

@register
class RandomForestClassifier_Model(_AbstractModelRepresentationDefault):

    klass = RandomForestClassifier
    category = StepCategories.Model
    type_of_variable = None

    custom_hyper = {"criterion": ("gini","entropy") }

    is_regression = False


    default_parameters = {"n_estimators":100}

Remark : the name of the class doesn’t matter, no instance will ever be created. It is just a nice way to write information.

A few things are needed within that class:

  • klass : should contain the actual class of the underlying model
  • category : one of the StepCategories choice
  • type_of_variable : if None it means that the model should be applied to the complete Dataset (this field will be used to create branches with the different type of variable in the pipeline)
  • is_regression : True, False or None (None means both)
  • default_parameters : to override the klass default parameters. Those parameters are use during the First Round of the Ml Machine
HyperParameters

Each class should be able to generate its hyper-parameters, that is done by default with the ‘get_hyper_parameter’ class method.

Here is what is done to generate the hyper-parameters:

For each parameters within the signature of the __init__ method of the klass:
  • if the parameters is present within ‘custom_hyper’, use that to generate the corresponding values
  • if it is present within the ‘default_hyper’, use that to generate the corresponding values
  • if not, don’t include it in the hyperparameters (and consequently the default value will be kept).
Default Hyper

Here is the list of the default hyper-parameters:

default_hyper = {
    "n_components" : hp.HyperRangeFloat(start = 0.1,end = 1,step = 0.05),

    # Forest like estimators
    "n_estimators" : hp.HyperComposition(
                        [(0.75,hp.HyperRangeInt(start = 25, end = 175, step = 25)),
                         (0.25,hp.HyperRangeInt(start = 200, end = 1000, step = 100))]),

    "max_features":hp.HyperComposition([ ( 0.25 , ["sqrt","auto"]),
                                         ( 0.75 , hp.HyperRangeBetaFloat(start = 0, end = 1,alpha = 3,beta = 1) )
                                       ]),

    "max_depth":hp.HyperChoice([None,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,20,25,30,50,100]),
    "min_samples_split":hp.HyperRangeBetaInt(start = 1,end = 100, alpha = 1, beta = 5),

    # Linear model
    "C":hp.HyperLogRangeFloat(start = 0.00001, end = 10, n = 50),

    "alpha":hp.HyperLogRangeFloat(start = 0.00001,end = 10, n = 50),

    # CV
    "analyzer":hp.HyperChoice(['word','char','char_wb']),
 }

This helps when hyper-parameters are common across many models.

Special HyperParameters

In some cases, hyper-parameters can’t be drawn independently from each other.

For example, by default, we might want to test either for a CountVectorizer:
  • ‘char’ encoding with bag of char of size 1,2,3 or 4
  • ‘word’ encoding only with bag of word of size 1

In that case we need to create a custom global hyper-parameter, that can be done by overriding the get_hyper_parameter() classmethod.

Example:

@register
class CountVectorizer_TextEncoder(_AbstractModelRepresentationDefault):
    klass = CountVectorizerWrapper
    category = StepCategories.TextEncoder
    type_of_variable = TypeOfVariables.TEXT


    @classmethod
    def get_hyper_parameter(cls):
        ### Specific function to handle the fact that I don't want ngram != 1 IF analyzer = word ###
        res = hp.HyperComposition([(0.5 , hp.HyperCrossProduct({"ngram_range":1,
                                                               "analyzer":"word",
                                                               "min_df":[1,0.001, 0.01, 0.05],
                                                               "max_df":[0.999, 0.99, 0.95]
                                                               }) ),
                                  (0.5  , hp.HyperCrossProduct({
                                              "ngram_range": hp.HyperRangeInt(start = 1,end = 4),
                                               "analyzer": hp.HyperChoice(("char","char_wb")) ,
                                               "min_df":[1, 0.001, 0.01, 0.05],
                                               "max_df":[0.999, 0.99, 0.95]
                                               }) )
                                    ])



        return res

This tells that the global hyperparameter is a composition between bag of word with ngram_range = 1 and bag of char with ngram between 1 and 4

(See Hyperparameters for detailed explanation on how to specify hyper-parameters)

Hyperparameters

Here you’ll find the explanation of how to specify random hyper-parameters. Those hyper-parameters are used to generate random values of the parameters of a given model.

Example, to generate a random integer between 1 and 10 (included) you can use HyperRangeInt:

hyper = HyperRangeInt(start = 1, end = 10)
[hyper.get_rand() for _ in range(5)]

>> [4,1,10,3,3]

The complete list of HyperParameters are available here :module:`aikit.ml_machine.hyper_parameters` Each class implements a ‘get_rand’ method.

HyperCrossProduct

This class is used to combine hyper-parameters together which is needed to generate complexe dictionnary-like hyper-parameters

Example:

hp = HyperCrossProduct({"int_value":HyperRangeInt(0,10), "float_value":HyperRangeFloat(0,1)})
hp.get_rand()

This will generate random dictionnary with keys ‘int_value’ and ‘float_value’

HyperComposition

This class is used to include dependency between parameters or to create an hyper parameters from two different distributions

Example:

hp = HyperComposition([ (0.9,HyperRangeInt(0,100)) ,(0.1,HyperRangeInt(100,1000)) ])
hp.get_rand()

This will generate a random number between 0 and 100 with probability 0.9 and one between 100 and 1000 with probability 0.1

hp = HyperComposition([
    (0.5 , HyperComposition({"analyzer":"char" , "n_gram_range":[1,4]})),
    (0.5 , HyperComposition({"analyzer":"word" , "n_gram_range":1}) )

])
hp.get_rand()

This will generate with probability:

  • 1/2 a dictionnary with “analyzer”:”char” and “n_gram_range”: random between 1 and 4
  • 1/2 a dictionnary with “analyzer”:”word” and “n_gram_range”: 1

Contribution

You can contribute aikit source code. To do that you need to do the following steps :

  1. fork aikit
  2. clone the fork
  3. create a branch with your new development
  4. push the branch and create a pull request

Indices and tables