ugtm: Generative Topographic Mapping with Python

Overview

Generative topographic mapping (GTM) is a probabilisitc dimensionality reduction algorithm introduced by Bishop, Svensen and Williams, which can also be used for classification and regression using class maps or activity landscapes:

ugtm v2.0 provides sklearn-compatible GTM transformer (eGTM), GTM classifier (eGTC) and GTM regressor (eGTR):

from ugtm import eGTM, eGTC, eGTR
import numpy as np

# Dummy train and test
X_train = np.random.randn(100, 50)
X_test = np.random.randn(50, 50)
y_train = np.random.choice([1, 2, 3], size=100)

# GTM transformer
transformed = eGTM().fit(X_train).transform(X_test)

# Predict new labels using GTM classifier (GTC)
predicted_labels = eGTC().fit(X_train, y_train).predict(X_test)

# Predict new continuous outcomes using GTM regressor (GTR)
predicted_labels = eGTR().fit(X_train, y_train).predict(X_test)

Installation

Prerequisites

ugtm requires Python 2.7 or + (tested on Python 3.4.6 and Python 2.7.14), with following packages:

  • scikit-learn>=0.20
  • numpy>=1.14.5
  • matplotlib>=2.2.2
  • scipy>=0.19.1
  • mpld3>=0.3
  • jinja2>=2.10

pip installation

Install using pip in the command line:

pip install ugtm

If this does not work, try upgrading packages:

sudo pip install --upgrade pip numpy scikit-learn matplotlib scipy mpld3 jinja2

Using anaconda

Example of anaconda virtual env “p2” for python 2.7.14:

conda create -n p2 python=2.7.14 numpy=1.14.5 \
scikit-learn=0.20 matplotlib=2.2.2 \
scipy=0.19.1 mpld3=0.3 jinja2=2.10

# Activate virtual env
source activate p2

# Install package
pip install ugtm

Example of anaconda virtual env p3 for python 3.6.6:

conda create -n p3 python=3.6.6 numpy=1.14.5 \
scikit-learn=0.20 matplotlib=2.2.2 \
scipy=0.19.1 mpld3=0.3 jinja2=2.10

# Activate virtual env
source activate p3

# Install package
pip install ugtm

Import package

In python console, import ugtm package:

import ugtm

eGTM: GTM transformer

Run GTM

eGTM is a sklearn-compatible GTM transformer. Similarly to PCA or t-SNE, eGTM reduces the dimensionality from n_dimensions to 2 dimensions. To generate mean GTM 2D projections:

from ugtm import eGTM
import numpy as np

X_train = np.random.randn(100, 50)
X_test = np.random.randn(50, 50)

# Fit GTM on X_train and get 2D projections for X_test
transformed = eGTM().fit(X_train).transform(X_test)

The default output of eGTM.transform is the mean GTM projection. For other data representations (modes, responsibilities), see transform().

Visualize projection

Visualization demo using altair https://altair-viz.github.io:

from ugtm import eGTM
import numpy as np
import altair as alt
import pandas as pd

X_train = np.random.randn(100, 50)
X_test = np.random.randn(50, 50)

transformed = eGTM().fit(X_train).transform(X_test)

df = pd.DataFrame(transformed, columns=["x1", "x2"])
alt.Chart(df).mark_point().encode(
x='x1',y='x2',
tooltip=["x1", "x2"]
).properties(title="GTM projection of X_test").interactive()

eGTC: GTM classifier

Run eGTC

eGTC is a sklearn-compatible GTM classifier. Similarly to PCA or t-SNE, GTM reduces the dimensionality from n_dimensions to 2 dimensions. GTC uses a GTM class map to predict labels for new data (cf. classMap()). Two algorithms are available: the bayesian classifier GTC (uGTC) or the nearest node classifier (uGTCnn). The following example uses the iris dataset:

from ugtm import eGTC
from sklearn import datasets
from sklearn import preprocessing
from sklearn import decomposition
from sklearn import metrics
from sklearn import model_selection

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.33, random_state=42)

# optional preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Predict labels for X_test
gtc = eGTC()
gtc = gtc.fit(X_train,y_train)
y_pred = gtc.predict(X_test)

# Print score
print(metrics.matthews_corrcoef(y_test,y_pred))

Visualize class map

The GTC algorithm is based on a classification map, discretized into a grid of nodes, which are colored by predicted label. To each node is associated class probabilities:

from ugtm import eGTM, eGTC
import numpy as np
import altair as alt
import pandas as pd
from sklearn import datasets
from sklearn import preprocessing
from sklearn import decomposition
from sklearn import metrics
from sklearn import model_selection

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.33, random_state=42)

# optional preprocessing
std = preprocessing.StandardScaler()
X_train = std.fit(X_train).transform(X_train)

# Construct class map
gtc = eGTC()
gtc = gtc.fit(X_train,y_train)

dfclassmap = pd.DataFrame(gtc.optimizedModel.matX, columns=["x1", "x2"])
dfclassmap["predicted node label"] = iris.target_names[gtc.node_label]
dfclassmap["probability of predominant class"] = np.max(gtc.node_probabilities,axis=1)

# Classification map
alt.Chart(dfclassmap).mark_square().encode(
    x='x1',
    y='x2',
    color='predicted node label:N',
    size=alt.value(50),
    opacity='probability of predominant class',
    tooltip=['x1','x2', 'predicted node label:N', 'probability of predominant class']
).properties(title = "Class map", width = 200, height = 200)

Visualize predicted vs real labels

Visualize predicted vs real labels using the iris dataset and altair:

from ugtm import eGTM, eGTC
import numpy as np
import altair as alt
import pandas as pd
from sklearn import datasets
from sklearn import preprocessing
from sklearn import decomposition
from sklearn import model_selection
from sklearn.metrics import confusion_matrix

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.33, random_state=42)

# optional preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Predict labels for X_test
gtc = eGTC()
gtc = gtc.fit(X_train,y_train)
y_pred = gtc.predict(X_test)

# Get GTM transform for X_test
transformed = eGTM().fit(X_train).transform(X_test)

df = pd.DataFrame(transformed, columns=["x1", "x2"])
df["predicted label"] = iris.target_names[y_pred]
df["true label"] = iris.target_names[y_test]
df["probability of predominant class"] = np.max(gtc.posteriors,axis=1)

# Projection of X_test colored by predicted label
chart1 = alt.Chart().mark_circle().encode(
    x='x1',y='x2',
    size=alt.value(100),
    color=alt.Color("predicted label:N",
           legend=alt.Legend(title="label")),
    opacity="probability of predominant class:Q",
    tooltip=["x1", "x2", "predicted label:N",
             "true label:N", "probability of predominant class"]
).properties(title="Pedicted labels", width=200, height=200).interactive()

# Projection of X_test colored by true label
chart2 = alt.Chart().mark_circle().encode(
    x='x1', y='x2',
    color=alt.Color("true label:N",
                    legend=alt.Legend(title="label")),
    size=alt.value(100),
    tooltip=["x1", "x2", "predicted label:N",
             "true label:N", "probability of predominant class"]
).properties(title="True labels", width=200, height=200).interactive()


alt.hconcat(chart1, chart2, data=df)

Parameter optimization

GridSearchCV can be used with eGTC for parameter optimization:

from ugtm import eGTC
import numpy as np
from sklearn.model_selection import GridSearchCV

# Dummy train and test
X_train = np.random.randn(100, 50)
X_test = np.random.randn(50, 50)
y_train = np.random.choice([1, 2, 3], size=100)

# Parameters to tune
tuned_params = {'regul': [0.0001, 0.001, 0.01],
                's': [0.1, 0.2, 0.3],
                'k': [16],
                'm': [4]}

# GTM classifier (GTC), bayesian
gs = GridSearchCV(eGTC(), tuned_params, cv=3, iid=False, scoring='accuracy')
gs.fit(X_train, y_train)
print(gs.best_params_)

eGTR: GTM regressor

Run eGTR

eGTR is a sklearn-compatible GTM regressor. Similarly to PCA or t-SNE, GTM reduces the dimensionality from n_dimensions to 2 dimensions. GTR uses a GTM class map to predict labels for new data (cf. landscape()). The following example uses the iris dataset:

from ugtm import eGTR
from sklearn import datasets
from sklearn import preprocessing
from sklearn import decomposition
from sklearn import model_selection

boston = datasets.load_boston()
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.33, random_state=42)

# optional preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Predict labels for X_test
gtr = eGTR()
gtr = gtr.fit(X_train,y_train)
y_pred = gtr.predict(X_test)

Visualize activity landscape

The GTR algorithm is based on an activity landscape. This landscape is discretized into a grid of nodes, which can be colored by predicted label. This visualization uses the python package altair:

from ugtm import eGTR, eGTM
import numpy as np
import altair as alt
import pandas as pd
from sklearn import datasets
from sklearn import preprocessing
from sklearn import decomposition
from sklearn import metrics
from sklearn import model_selection

boston = datasets.load_boston()
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.33, random_state=42)

# optional preprocessing
std = preprocessing.StandardScaler()
X_train = std.fit(X_train).transform(X_train)

# Construct activity landscape
gtr = eGTR()
gtr = gtr.fit(X_train,y_train)

dfclassmap = pd.DataFrame(gtr.optimizedModel.matX, columns=["x1", "x2"])
dfclassmap["predicted node label"] = gtr.node_label

# Classification map
alt.Chart(dfclassmap).mark_square().encode(
    x='x1',
    y='x2',
    color=alt.Color('predicted node label:Q',
                    scale=alt.Scale(scheme='greenblue'),
                    legend=alt.Legend(title="Boston house prices")),
    size=alt.value(50),
    tooltip=['x1','x2', 'predicted node label:Q']
).properties(title = "Activity landscape", width = 200, height = 200)

Visualize predicted vs real labels

This visualization uses the python package altair:

from ugtm import eGTM, eGTR
import numpy as np
import altair as alt
import pandas as pd
from sklearn import datasets
from sklearn import preprocessing
from sklearn import decomposition
from sklearn import metrics
from sklearn import model_selection

boston = datasets.load_boston()
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.33, random_state=42)

# optional preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Predict labels for X_test
gtr = eGTR()
gtr = gtr.fit(X_train,y_train)
y_pred = gtr.predict(X_test)

# Get GTM transform for X_test
transformed = eGTM().fit(X_train).transform(X_test)

df = pd.DataFrame(transformed, columns=["x1", "x2"])
df["predicted label"] = y_pred
df["true label"] = y_test

chart1 = alt.Chart(df).mark_point().encode(
x='x1',y='x2',
color=alt.Color("predicted label:Q",scale=alt.Scale(scheme='greenblue'),
                 legend=alt.Legend(title="Boston house prices")),
tooltip=["x1", "x2", "predicted label:Q", "true label:Q"]
).properties(title="Pedicted labels", width=200, height=200).interactive()

chart2 = alt.Chart(df).mark_point().encode(
x='x1',y='x2',
color=alt.Color("true label:Q",scale=alt.Scale(scheme='greenblue'),
                legend=alt.Legend(title="Boston house prices")),
tooltip=["x1", "x2", "predicted label:Q", "true label:Q"]
).properties(title="True labels", width=200, height=200).interactive()

alt.hconcat(chart1, chart2)

Parameter optimization

GridSearchCV from sklearn can be used with eGTC for parameter optimization:

from ugtm import eGTR
import numpy as np
from sklearn.model_selection import GridSearchCV

# Dummy train and test
X_train = np.random.randn(100, 50)
X_test = np.random.randn(50, 50)
y_train = np.random.choice([1, 2, 3], size=100)

# Parameters to tune
tuned_params = {'regul': [0.0001, 0.001, 0.01],
                's': [0.1, 0.2, 0.3],
                'k': [16],
                'm': [4]}

# GTM classifier (GTR)
gs = GridSearchCV(eGTR(), tuned_params, cv=3, iid=False, scoring='r2')
gs.fit(X_train, y_train)
print(gs.best_params_)

API Reference

API reference of ugtm. This documentation was generated automatically from docstrings.

Modules

ugtm ugtm: a python package for Generative Topographic Mapping (GTM)

Visualization examples

GTM visualization examples from following datasets:

S-curve

from ugtm import eGTM,eGTR
import numpy as np
import altair as alt
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn import model_selection
from sklearn import manifold

X,y = datasets.make_s_curve(n_samples=1000, random_state=0)
man = manifold.TSNE(n_components=2, init='pca', random_state=0)
tsne = man.fit_transform(X)
man = manifold.MDS(max_iter=100, n_init=1, random_state=0)
mds = man.fit_transform(X)
man = manifold.LocallyLinearEmbedding(n_neighbors=20, n_components=2,
                                      eigen_solver='auto',
                                      method="standard",
                                      random_state=0)
lle = man.fit_transform(X)

# Construct GTM
gtm = eGTM(m=2).fit(X)
gtm_means = gtm.transform(X,model="means")
gtm_modes = gtm.transform(X,model="modes")

dgtm_modes = pd.DataFrame(gtm_modes, columns=["x1", "x2"])
dgtm_modes["label"] = y

gtm_modes = alt.Chart(dgtm_modes).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:Q']
).properties(title = "GTM (modes)", width = 100, height = 100)

dgtm_means = pd.DataFrame(gtm_means, columns=["x1", "x2"])
dgtm_means["label"] = y

gtm_means = alt.Chart(dgtm_means).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:Q']
).properties(title = "GTM (means)", width = 100, height = 100)

#Construct activity landscape
gtr = eGTR(m=2)
gtr = gtr.fit(X,y)

dfclassmap = pd.DataFrame(gtr.optimizedModel.matX, columns=["x1", "x2"])
dfclassmap["label"] = gtr.node_label

# Classification map
gtr = alt.Chart(dfclassmap).mark_square().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2', 'label:Q'],
    #opacity='density'
).properties(title = "GTM landscape",width = 100, height = 100)

dtsne = pd.DataFrame(tsne, columns=["x1", "x2"])
dmds = pd.DataFrame(mds, columns=["x1", "x2"])
dlle = pd.DataFrame(lle, columns=["x1", "x2"])
dtsne["label"] = y
dmds["label"] = y
dlle["label"] = y

tsne = alt.Chart(dtsne).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:Q']
).properties(title = "t-SNE", width = 100, height = 100)

mds = alt.Chart(dmds).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:Q']
).properties(title = "MDS", width = 100, height = 100)

lle = alt.Chart(dlle).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:Q']
).properties(title = "LLE", width = 100, height = 100)


gtm = gtm_means | gtm_modes | gtr
others = tsne | mds | lle

alt.vconcat(gtm, others)

Severed sphere

from ugtm import eGTM,eGTR
import numpy as np
import altair as alt
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn import model_selection
from sklearn import manifold
from sklearn.utils import check_random_state

random_state = check_random_state(0)
p = random_state.rand(1000) * (2 * np.pi - 0.55)
t = random_state.rand(1000) * np.pi

# Sever the poles from the sphere.
indices = ((t < (np.pi - (np.pi / 8))) & (t > ((np.pi / 8))))
x, y, z = np.sin(t[indices]) * np.cos(p[indices]), \
    np.sin(t[indices]) * np.sin(p[indices]), \
    np.cos(t[indices])

X = np.array([x, y, z]).T

y = p[indices]

man = manifold.TSNE(n_components=2, init='pca', random_state=0)
tsne = man.fit_transform(X)
man = manifold.MDS(max_iter=100, n_init=1, random_state=0)
mds = man.fit_transform(X)
man = manifold.LocallyLinearEmbedding(n_neighbors=10, n_components=2,
                                      eigen_solver='auto',
                                      method="standard",
                                      random_state=0)
lle = man.fit_transform(X)

# Construct GTM
gtm = eGTM(m=2).fit(X)
gtm_means = gtm.transform(X,model="means")
gtm_modes = gtm.transform(X,model="modes")

dgtm_modes = pd.DataFrame(gtm_modes, columns=["x1", "x2"])
dgtm_modes["label"] = y

gtm_modes = alt.Chart(dgtm_modes).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:Q']
).properties(title = "GTM (modes)", width = 100, height = 100)

dgtm_means = pd.DataFrame(gtm_means, columns=["x1", "x2"])
dgtm_means["label"] = y

gtm_means = alt.Chart(dgtm_means).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:Q']
).properties(title = "GTM (means)", width = 100, height = 100)

#Construct activity landscape
gtr = eGTR(m=2)
gtr = gtr.fit(X,y)

dfclassmap = pd.DataFrame(gtr.optimizedModel.matX, columns=["x1", "x2"])
dfclassmap["label"] = gtr.node_label

# Classification map
gtr = alt.Chart(dfclassmap).mark_square().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2', 'label:Q'],
    #opacity='density'
).properties(title = "GTM landscape",width = 100, height = 100)

dtsne = pd.DataFrame(tsne, columns=["x1", "x2"])
dmds = pd.DataFrame(mds, columns=["x1", "x2"])
dlle = pd.DataFrame(lle, columns=["x1", "x2"])
dtsne["label"] = y
dmds["label"] = y
dlle["label"] = y

tsne = alt.Chart(dtsne).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:Q']
).properties(title = "t-SNE", width = 100, height = 100)

mds = alt.Chart(dmds).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:Q']
).properties(title = "MDS", width = 100, height = 100)

lle = alt.Chart(dlle).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:Q',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:Q']
).properties(title = "LLE", width = 100, height = 100)


gtm = gtm_means | gtm_modes | gtr
others = tsne | mds | lle

alt.vconcat(gtm, others)

Hand-written digits

from ugtm import eGTM,eGTC
import numpy as np
import altair as alt
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn import model_selection
from sklearn import manifold
from sklearn.utils import check_random_state


digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target

man = manifold.TSNE(n_components=2, init='pca', random_state=0)
tsne = man.fit_transform(X)
man = manifold.MDS(max_iter=100, n_init=1, random_state=0)
mds = man.fit_transform(X)
man = manifold.LocallyLinearEmbedding(n_neighbors=20, n_components=2,
                                      eigen_solver='auto',
                                      method="standard",
                                      random_state=0)
lle = man.fit_transform(X)

# Construct GTM
gtm = eGTM().fit(X)
gtm_means = gtm.transform(X,model="means")
gtm_modes = gtm.transform(X,model="modes")

dgtm_modes = pd.DataFrame(gtm_modes, columns=["x1", "x2"])
dgtm_modes["label"] = y

gtm_modes = alt.Chart(dgtm_modes).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:N',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:N']
).properties(title = "GTM (modes)", width = 100, height = 100)

dgtm_means = pd.DataFrame(gtm_means, columns=["x1", "x2"])
dgtm_means["label"] = y

gtm_means = alt.Chart(dgtm_means).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:N',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:N']
).properties(title = "GTM (means)", width = 100, height = 100)

#Construct activity landscape
gtc = eGTC()
gtc = gtc.fit(X,y)

dfclassmap = pd.DataFrame(gtc.optimizedModel.matX, columns=["x1", "x2"])
dfclassmap["label"] = gtc.node_label

# Classification map
gtc = alt.Chart(dfclassmap).mark_square().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:N',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2', 'label:N'],
    #opacity='density'
).properties(title = "GTM class map",width = 100, height = 100)

dtsne = pd.DataFrame(tsne, columns=["x1", "x2"])
dmds = pd.DataFrame(mds, columns=["x1", "x2"])
dlle = pd.DataFrame(lle, columns=["x1", "x2"])
dtsne["label"] = digits.target_names[y]
dmds["label"] = digits.target_names[y]
dlle["label"] = digits.target_names[y]

tsne = alt.Chart(dtsne).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:N',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:N']
).properties(title = "t-SNE", width = 100, height = 100)

mds = alt.Chart(dmds).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:N',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:N']
).properties(title = "MDS", width = 100, height = 100)

lle = alt.Chart(dlle).mark_circle().encode(
    x='x1',
    y='x2',
    color=alt.Color('label:N',
                    scale=alt.Scale(scheme='viridis')),
    size=alt.value(50),
    tooltip=['x1','x2','label:N']
).properties(title = "LLE", width = 100, height = 100)


gtm = gtm_means | gtm_modes | gtc
others = tsne | mds | lle

alt.vconcat(gtm, others)

Classification examples

Breast cancer

We use the breast cancer wisconsin dataset loaded from sklearn, downloaded from https://goo.gl/U2Uwz2.

The variables are the following:

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter^2 / area - 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” - 1)

The target variable is the diagnosis (malignant/benign).

Example of parameter selection and cross-validation using GTM classification (GTC) and SVM classification (SVC):

from ugtm import eGTC
from sklearn.datasets import load_breast_cancer
import numpy as np
from sklearn import model_selection
from sklearn.metrics import balanced_accuracy_score
from sklearn.svm import SVC
from sklearn.metrics import classification_report


data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.33, random_state=42, shuffle=True)

performances = {}


# GTM classifier (GTC), bayesian

tuned_params = {'regul': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
                's': [0.1, 0.2, 0.3],
                'k': [16],
                'm': [4]}

gs = model_selection.GridSearchCV(eGTC(), tuned_params, cv=3, iid=False, scoring='balanced_accuracy')

gs.fit(X_train, y_train)

# Returns best score and best parameters
print(gs.best_score_)
print(gs.best_params_)

# Test data using model built with best parameters
y_true, y_pred = y_test, gs.predict(X_test)
print(classification_report(y_true, y_pred))

# Record performance on test set
performances['gtc'] = balanced_accuracy_score(y_true, y_pred)


# SVM classifier (SVC)

tuned_params = {'C':[1,10,100,1000],
                'gamma':[1,0.1,0.001,0.0001],
                'kernel':['rbf']}

gs = model_selection.GridSearchCV(SVC(random_state=42), tuned_params, cv=3, iid=False, scoring='balanced_accuracy')

gs.fit(X_train, y_train)

# Returns best score and best parameters
print(gs.best_score_)
print(gs.best_params_)

# Test data using model built with best parameters
y_true, y_pred = y_test, gs.predict(X_test)
print(classification_report(y_true, y_pred))

# Record performance on test set
performances['svm'] = balanced_accuracy_score(y_test, y_pred)

# Algorithm with best performance
max(performances.items(), key = lambda x: x[1])

Regression examples

Wine quality

We use the wine quality dataset from http://archive.ics.uci.edu/.

Example of parameter selection and cross-validation using GTM regression (GTR) and SVM classification (SVR):

from ugtm import eGTR
import numpy as np
from numpy import sqrt
from sklearn import model_selection
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyRegressor
import pandas as pd

# Load red wine data
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url,sep=";")
y = data['quality']
X = data.drop(labels='quality',axis=1)


X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.10, random_state=42, shuffle=True, random_state=42)


std = StandardScaler().fit(X_train)
X_train = std.transform(X_train)
X_test = std.transform(X_test)

performances = {}

# GTM classifier (GTR), bayesian

tuned_params = {'regul': [0.0001, 0.001, 0.01, 0.1, 1],
        's': [0.1, 0.2, 0.3],
        'k': [25],
        'm': [5]}

gs = model_selection.GridSearchCV(eGTR(), tuned_params, cv=3, iid=False, scoring='neg_mean_squared_error')

gs.fit(X_train, y_train)

# Returns best score and best parameters
print(gs.best_score_)
print(gs.best_params_)

# Test data using model built with best parameters
y_true, y_pred = y_test, gs.predict(X_test)

# Record performance on test set (RMSE)
performances['gtr'] = sqrt(mean_squared_error(y_true, y_pred))

# SVM regressor (SVR)

tuned_params = {'C':[1,10,100,1000],
        'gamma':[1,0.1,0.001,0.0001],
        'kernel':['rbf']}

gs = model_selection.GridSearchCV(SVR(), tuned_params, cv=3, iid=False, scoring='neg_mean_squared_error')

gs.fit(X_train, y_train)

# Returns best score and best parameters
print(gs.best_score_)
print(gs.best_params_)

# Test data using model built with best parameters
y_true, y_pred = y_test, gs.predict(X_test)

# Record performance on test set
performances['svm'] = sqrt(mean_squared_error(y_test, y_pred))

# Create a dummy regressor
dummy = DummyRegressor(strategy='mean')

# Train dummy regressor
dummy.fit(X_train, y_train)
y_true, y_pred = y_test, dummy.predict(X_test)

# Dummy performance
performances['dummy'] = sqrt(mean_squared_error(y_test, y_pred))

tutorial

ugtm provides an implementation of GTM (Generative Topographic Mapping), kGTM (kernel Generative Topographic Mapping), GTM classification models (kNN, Bayes) and GTM regression models. ugtm also implements cross-validation options which can be used to compare GTM classification models to SVM classification models, and GTM regression models to SVM regression models. Typical usage:

#!/usr/bin/env python

import ugtm
import numpy as np

#generate sample data and labels: replace this with your own data
data=np.random.randn(100,50)
labels=np.random.choice([1,2],size=100)

#build GTM map
gtm=ugtm.runGTM(data=data,verbose=True)

#plot GTM map (html)
gtm.plot_html(output="out")

For installation instructions, cf. https://github.com/hagax8/ugtm

1. Import package

Import ugtm package, allowing to construct GTM and kernel GTM (kGTM) maps, GTM classification models, GTM regression models:

import ugtm

2. Construct and plot GTM maps (or kGTM maps)

A gtm object can be created by running the function runGTM on a dataset. Parameters for runGTM are: k = sqrt(number of nodes), m = sqrt(number of rbf centres), s = RBF width factor, regul = regularization coefficient. The number of iteration for the expectation-maximization algorithm is set to 200 by default. This is an example with random data:

import ugtm

#import numpy to generate random data
import numpy as np

#generate random data (independent variables x),
#discrete labels (dependent variable y),
#and continuous labels (dependent variable y),
#to experiment with categorical or continuous outcomes

train = np.random.randn(20,10)
test = np.random.randn(20,10)
labels=np.random.choice(["class1","class2"],size=20)
activity=np.random.randn(20,1)

#create a gtm object and write model
gtm = ugtm.runGTM(train)
gtm.write("testout1")

#run verbose
gtm = ugtm.runGTM(train, verbose=True)

#to run a kernel GTM model instead, run following:
gtm = ugtm.runkGTM(train, doKernel=True, kernel="linear")

#access coordinates (means or modes), and responsibilities of gtm object
gtm_coordinates = gtm.matMeans
gtm_modes = gtm.matModes
gtm_responsibilities = gtm.matR

3. Plot html maps

Call the plot_html() function on the gtm object:

#run model on train
gtm = ugtm.runGTM(train)

# ex. plot gtm object with landscape, html: labels are continuous
gtm.plot_html(output="testout10",labels=activity,discrete=False,pointsize=20)

# ex. plot gtm object with landscape, html: labels are discrete
gtm.plot_html(output="testout11",labels=labels,discrete=True,pointsize=20)

# ex. plot gtm object with landscape, html: labels are continuous
# no interpolation between nodes
gtm.plot_html(output="testout12",labels=activity,discrete=False,pointsize=20, \
              do_interpolate=False,ids=labels)

# ex. plot gtm object with landscape, html: labels are discrete,
# no interpolation between nodes
gtm.plot_html(output="testout13",labels=labels,discrete=True,pointsize=20, \
              do_interpolate=False)

4. Plot pdf maps

Call the plot() function on the gtm object:

#run model on train
gtm = ugtm.runGTM(train)

# ex. plot gtm object, pdf: no labels
gtm.plot(output="testout6",pointsize=20)

# ex. plot gtm object with landscape, pdf: labels are discrete
gtm.plot(output="testout7",labels=labels,discrete=True,pointsize=20)

# ex. plot gtm object with landscape, pdf: labels are continuous
gtm.plot(output="testout8",labels=activity,discrete=False,pointsize=20)

5. Plot multipanel views (only if labels or activities are provided)

Call the plot_multipanel() function on the gtm object. This plots a general model view, showing means, modes, landscape with or without points. The plot_multipanel function only works if you have defined labels:

#run model on train
gtm = ugtm.runGTM(train)

# ex. with discrete labels and inter-node interpolation
gtm.plot_multipanel(output="testout2",labels=labels,discrete=True,pointsize=20)

# ex. with continuous labels and inter-node interpolation
gtm.plot_multipanel(output="testout3",labels=activity,discrete=False,pointsize=20)

# ex. with discrete labels and no inter-node interpolation
gtm.plot_multipanel(output="testout4",labels=labels,discrete=True,pointsize=20, \
                    do_interpolate=False)

# ex. with continuous labels and no inter-node interpolation
gtm.plot_multipanel(output="testout5",labels=activity,discrete=False,pointsize=20, \
                    do_interpolate=False)

6. Project new data onto existing GTM map

New data can be projected on the GTM map by using the transform() function, which takes as input the gtm model, a training and test set. The train set is then only used to perform data preprocessing on the test set based on the train (for example: apply the same PCA transformation to the train and test sets before running the algorithm):

#run model on train
gtm = ugtm.runGTM(train,doPCA=True)

#test new data (test)
transformed=ugtm.transform(optimizedModel=gtm,train=train,test=test,doPCA=True)

#plot transformed test (html)
transformed.plot_html(output="testout14",pointsize=20)

#plot transformed test (pdf)
transformed.plot(output="testout15",pointsize=20)

#plot transformed data on existing classification model,
#using training set labels
gtm.plot_html_projection(output="testout16",projections=transformed,\
                         labels=labels, \
                         discrete=True,pointsize=20)

7. Output predictions for a test set: GTM regression (GTR) and classification (GTC)

The GTR() function implements the GTM regression model (cf. references) and GTC() function a GTM classification model (cf. references):

#continuous labels (prediction by GTM regression model)
predicted=ugtm.GTR(train=train,test=test,labels=activity)

#discrete labels (prediction by GTM classification model)
predicted=ugtm.GTC(train=train,test=test,labels=labels)

8. Advanced GTM predictions with per-class probabilities

Per-class probabilities for a test set can be given by the advancedGTC() function (you can set the m, k, regul, s parameters just as with runGTM):

#get whole output model and label predictions for test set
predicted_model=ugtm.advancedGTC(train=train,test=test,labels=labels)

#write whole predicted model with per-class probabilities
ugtm.printClassPredictions(predicted_model,"testout17")

9. Crossvalidation experiments

Different crossvalidation experiments were implemented to compare GTC and GTR models to classical machine learning methods:

#crossvalidation experiment: GTM classification model implemented in ugtm,
#here: set hyperparameters s=1 and regul=1 (set to -1 to optimize)
ugtm.crossvalidateGTC(data=train,labels=labels,s=1,regul=1,n_repetitions=10,n_folds=5)

#crossvalidation experiment: GTM regression model
ugtm.crossvalidateGTR(data=train,labels=activity,s=1,regul=1)

#you can also run the following functions to compare
#with other classification/regression algorithms:

#crossvalidation experiment, k-nearest neighbours classification
#on 2D PCA map with 7 neighbors (set to -1 to optimize number of neighbours)
ugtm.crossvalidatePCAC(data=train,labels=labels,n_neighbors=7)

#crossvalidation experiment, SVC rbf classification model (sklearn implementation):
ugtm.crossvalidateSVCrbf(data=train,labels=labels,C=1,gamma=1)

#crossvalidation experiment, linear SVC classification model (sklearn implementation):
ugtm.crossvalidateSVC(data=train,labels=labels,C=1)

#crossvalidation experiment, linear SVC regression model (sklearn implementation):
ugtm.crossvalidateSVR(data=train,labels=activity,C=1,epsilon=1)

#crossvalidation experiment, k-nearest neighbours regression on 2D PCA map with 7 neighbors:
ugtm.crossvalidatePCAR(data=train,labels=activity,n_neighbors=7)

Glossary