pyXpcm: Ocean Profile Classification Model

pyXpcm is a python package to create and work with ocean Profile Classification Model that consumes and produces Xarray objects. Xarray objects are N-D labeled arrays and datasets in Python.

An ocean Profile Classification Model allows to automatically assemble ocean profiles in clusters according to their vertical structure similarities. The geospatial properties of these clusters can be used to address a large variety of oceanographic problems: front detection, water mass identification, natural region contouring (gyres, eddies), reference profile selection for QC validation, etc… The vertical structure of these clusters furthermore provides a highly synthetic representation of large ocean areas that can be used for dimensionality reduction and coherent intercomparisons of ocean data (re)-analysis or simulations.

Documentation

Getting Started

Overview

What is an ocean PCM?

An ocean PCM is a Profile Classification Model for ocean data, a statistical procedure to classify ocean vertical profiles into a finite set of “clusters”. Depending on the dataset, such clusters can show space/time coherence that can be used in many different ways to study the ocean.

Statistic method

It consists in conducting un-supervised classification (clustering) with vertical profiles of one or more ocean variables.

Each levels of the vertical axis of each ocean variables are considered a feature. One ocean vertical profile with ocean variables is considered a sample.

All the details of the Profile Classification Modelling (PCM) statistical methodology can be found in Maze et al, 2017.

Illustration

Given a collection of Argo temperature profiles in the North Atlantic, a PCM analysis is applied and produces an optimal set of 8 ocean temperature profile classes. The PCM clusters synthesize the structural information of heat distribution in the North Atlantic. Each clusters objectively define an ocean region where dynamic gives rise to an unique vertical stratification pattern.

_images/graphical-abstract.png

Maze et al, 2017 applied it to the North Atlantic with Argo temperature data. Jones et al, 2019, later applied it to the Southern Ocean, also with Argo temperature data. Rosso et al (in prep) has applied it to the Southern Indian Ocean using both temperature and salinity Argo data.

pyXpcm

pyXpcm is an Python implementation of the PCM method that consumes and produces Xarray objects (xarray.Dataset and xarray.DataArray), hence the x.

With pyXpcm you can conduct a PCM analysis for a collection of profiles (gridded or not), of one or more ocean variables, stored in an xarray.Dataset. pyXpcm also provides basic statistics and plotting functions to get you started with your analysis.

The philosophy of the pyXpcm toolbox is to create and be able to use a PCM from and on different ocean datasets and variables. In order to achieve this, a PCM is created with information about ocean variables to classify and the vertical axis of these variables. Then this PCM can be fitted and subsequently classify ocean profiles from any datasets, as long as it contains the PCM variables.

The pyXpcm procedure is to preprocess (stack, scale, reduce and combine data) and then to fit a classifier on data. Once the model is fitted pyXpcm can classify data. The library uses many language and logic from Scikit-learn but doesn’t inherit from a sklearn.BaseEstimator.

Installation

Required dependencies

  • Python 3.6
  • Xarray 0.12
  • Dask 0.16
  • Scikit-learn 0.19

Note that Scikit-learn is the default statistic backend, but that if Dask_ml is installed you can use it as well (see API reference).

For full plotting functionality (see the Plotting API) the following packages are required:

  • Matplotlib 3.0 (mandatory)
  • Cartopy 0.17 (for some methods only)
  • Seaborn 0.9.0 (for some methods only)

Instructions

For the development version:

pip install git+https://github.com/gmaze/pyxpcm.git@nfeatures

For the latest public release:

pip install pyxpcm

Pre-trained PCM

We’ll try to list here pre-trained PCM models ready for you to use and classify your data with.

Features K [1] Relevant domain Training data Access link Reference
Temperature (0-1400m,5m) 8 North Atlantic Argo (2000-2014) Archimer Maze et al (2017)
Temperature (15-980db,5db) 8 Southern Ocean Argo (2001-2017) [2] Jones et al (2019)
[1]Number of classes
[2]Coming up soon !

User guide

Standard procedure

Here is a standard procedure based on pyXpcm. This will show you how to create a model, how to fit/train it, how to classify data and to visualise results.

Create a model

Let’s import the Profile Classification Model (PCM) constructor:

[2]:
from pyxpcm.models import pcm

A PCM can be created independently of any dataset using the class constructor.

To be created a PCM requires a number of classes (or clusters) and a dictionary to define the list of features and their vertical axis:

[3]:
z = np.arange(0.,-1000,-10.)
pcm_features = {'temperature': z, 'salinity':z}

We can now instantiate a PCM, say with 8 classes:

[4]:
m = pcm(K=8, features=pcm_features)
m
[4]:
<pcm 'gmm' (K: 8, F: 2)>
Number of class: 8
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: False
Feature: 'temperature'
     Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
     Scaler: 'normal', <class 'sklearn.preprocessing.data.StandardScaler'>
     Reducer: True, <class 'sklearn.decomposition.pca.PCA'>
Feature: 'salinity'
     Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
     Scaler: 'normal', <class 'sklearn.preprocessing.data.StandardScaler'>
     Reducer: True, <class 'sklearn.decomposition.pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture.gaussian_mixture.GaussianMixture'>

Here we created a PCM with 8 classes (K=8) and 2 features (F=2) that are temperature and salinity profiles defined between the surface and 1000m depth.

We furthermore note the list of transform methods that will be used to preprocess each of the features (see the preprocessing documentation page for more details).

Note that the number of classes and features are PCM properties accessible at pyxpcm.pcm.K and pyxpcm.pcm.F.

Load training data

pyXpcm is able to work with both gridded datasets (eg: model outputs with longitude,latitude,time dimensions) and collection of profiles (eg: Argo, XBT, CTD section profiles).

In this example, let’s import a sample of North Atlantic Argo data that come with pyxpcm.pcm:

[5]:
import pyxpcm
ds = pyxpcm.tutorial.open_dataset('argo').load()
print(ds)
<xarray.Dataset>
Dimensions:    (DEPTH: 282, N_PROF: 7560)
Coordinates:
  * DEPTH      (DEPTH) float32 0.0 -5.0 -10.0 -15.0 ... -1395.0 -1400.0 -1405.0
Dimensions without coordinates: N_PROF
Data variables:
    LATITUDE   (N_PROF) float32 ...
    LONGITUDE  (N_PROF) float32 ...
    TIME       (N_PROF) datetime64[ns] ...
    DBINDEX    (N_PROF) float64 ...
    TEMP       (N_PROF, DEPTH) float32 ...
    PSAL       (N_PROF, DEPTH) float32 ...
    SIG0       (N_PROF, DEPTH) float32 ...
    BRV2       (N_PROF, DEPTH) float32 ...
Attributes:
    Sample test prepared by:  G. Maze
    Institution:              Ifremer/LOPS
    Data source DOI:          10.17882/42182

Fit the model on data

Fitting can be done on any dataset coherent with the PCM definition, in a sense that it must have the feature variables of the PCM.

To tell the PCM model how to identify features in any xarray.Dataset, we need to provide a dictionary of variable names mapping:

[6]:
features_in_ds = {'temperature': 'TEMP', 'salinity': 'PSAL'}

which means that the PCM feature temperature is to be found in the dataset variables TEMP.

We also need to specify what is the vertical dimension of the dataset variables:

[7]:
features_zdim='DEPTH'

Now we’re ready to fit the model on the this dataset:

[8]:
m.fit(ds, features=features_in_ds, dim=features_zdim)
m
[8]:
<pcm 'gmm' (K: 8, F: 2)>
Number of class: 8
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: True
Feature: 'temperature'
     Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
     Scaler: 'normal', <class 'sklearn.preprocessing.data.StandardScaler'>
     Reducer: True, <class 'sklearn.decomposition.pca.PCA'>
Feature: 'salinity'
     Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
     Scaler: 'normal', <class 'sklearn.preprocessing.data.StandardScaler'>
     Reducer: True, <class 'sklearn.decomposition.pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture.gaussian_mixture.GaussianMixture'>
     log likelihood of the training set: 37.953397

Note

pyXpcm can also identify PCM features and axis within a xarray.DataSet with variable attributes. From the example above we can set:

ds['TEMP'].attrs['feature_name'] = 'temperature'
ds['PSAL'].attrs['feature_name'] = 'salinity'
ds['DEPTH'].attrs['axis'] = 'Z'

And then simply call the fit method without arguments:

m.fit(ds)

Note that if data follows the CF the vertical dimension axis attribute should already be set to Z.

Classify data

Now that the PCM is fitted, we can predict the classification results like:

[9]:
m.predict(ds, features=features_in_ds, inplace=True)
ds
[9]:
<xarray.Dataset>
Dimensions:     (DEPTH: 282, N_PROF: 7560)
Coordinates:
  * N_PROF      (N_PROF) int64 0 1 2 3 4 5 6 ... 7554 7555 7556 7557 7558 7559
  * DEPTH       (DEPTH) float32 0.0 -5.0 -10.0 -15.0 ... -1395.0 -1400.0 -1405.0
Data variables:
    LATITUDE    (N_PROF) float32 ...
    LONGITUDE   (N_PROF) float32 ...
    TIME        (N_PROF) datetime64[ns] ...
    DBINDEX     (N_PROF) float64 ...
    TEMP        (N_PROF, DEPTH) float32 27.422163 27.422163 ... 4.391791
    PSAL        (N_PROF, DEPTH) float32 36.35267 36.35267 ... 34.910286
    SIG0        (N_PROF, DEPTH) float32 ...
    BRV2        (N_PROF, DEPTH) float32 ...
    PCM_LABELS  (N_PROF) int64 4 4 4 4 4 4 4 4 4 4 4 4 ... 3 3 3 3 3 3 3 3 3 3 3
Attributes:
    Sample test prepared by:  G. Maze
    Institution:              Ifremer/LOPS
    Data source DOI:          10.17882/42182

Prediction labels are automatically added to the dataset as PCM_LABELS because the option inplace was set to True. We didn’t specify the dim option because our dataset is CF compliant.

pyXpcm use a GMM classifier by default, which is a fuzzy classifier. So we can also predict the probability of each classes for all profiles, the so-called posteriors:

[10]:
m.predict_proba(ds, features=features_in_ds, inplace=True)
ds
[10]:
<xarray.Dataset>
Dimensions:     (DEPTH: 282, N_PROF: 7560, pcm_class: 8)
Coordinates:
  * N_PROF      (N_PROF) int64 0 1 2 3 4 5 6 ... 7554 7555 7556 7557 7558 7559
  * DEPTH       (DEPTH) float32 0.0 -5.0 -10.0 -15.0 ... -1395.0 -1400.0 -1405.0
Dimensions without coordinates: pcm_class
Data variables:
    LATITUDE    (N_PROF) float32 ...
    LONGITUDE   (N_PROF) float32 ...
    TIME        (N_PROF) datetime64[ns] ...
    DBINDEX     (N_PROF) float64 ...
    TEMP        (N_PROF, DEPTH) float32 27.422163 27.422163 ... 4.391791
    PSAL        (N_PROF, DEPTH) float32 36.35267 36.35267 ... 34.910286
    SIG0        (N_PROF, DEPTH) float32 ...
    BRV2        (N_PROF, DEPTH) float32 ...
    PCM_LABELS  (N_PROF) int64 4 4 4 4 4 4 4 4 4 4 4 4 ... 3 3 3 3 3 3 3 3 3 3 3
    PCM_POST    (pcm_class, N_PROF) float64 3.822e-09 3.296e-05 ... 2.269e-70
Attributes:
    Sample test prepared by:  G. Maze
    Institution:              Ifremer/LOPS
    Data source DOI:          10.17882/42182

which are added to the dataset as the PCM_POST variables. The probability of classes for each profiles has a new dimension pcm_class by default that goes from 0 to K-1.

Note

You can delete variables added by pyXpcm to the xarray.DataSet with the pyxpcm.xarray.pyXpcmDataSetAccessor.drop_all() method:

ds.pyxpcm.drop_all()

Or you can split pyXpcm variables out of the original xarray.DataSet:

ds_pcm, ds = ds.pyxpcm.split()

It is important to note that once the PCM is fitted, you can predict labels for any dataset, as long as it has the PCM features.

For instance, let’s predict labels for a gridded dataset:

[12]:
ds_gridded = pyxpcm.tutorial.open_dataset('isas_snapshot').load()
ds_gridded
[12]:
<xarray.Dataset>
Dimensions:      (depth: 152, latitude: 53, longitude: 61)
Coordinates:
  * latitude     (latitude) float32 30.023445 30.455408 ... 49.41288 49.737103
  * longitude    (longitude) float32 -70.0 -69.5 -69.0 ... -41.0 -40.5 -40.0
  * depth        (depth) float32 -1.0 -3.0 -5.0 ... -1960.0 -1980.0 -2000.0
Data variables:
    TEMP         (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    TEMP_ERR     (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    TEMP_PCTVAR  (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    PSAL         (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    PSAL_ERR     (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    PSAL_PCTVAR  (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    SST          (latitude, longitude) float32 dask.array<shape=(53, 61), chunksize=(53, 61)>
[13]:
m.predict(ds_gridded, features={'temperature':'TEMP','salinity':'PSAL'}, dim='depth', inplace=True)
ds_gridded
[13]:
<xarray.Dataset>
Dimensions:      (depth: 152, latitude: 53, longitude: 61)
Coordinates:
  * latitude     (latitude) float64 30.02 30.46 30.89 ... 49.09 49.41 49.74
  * longitude    (longitude) float64 -70.0 -69.5 -69.0 ... -41.0 -40.5 -40.0
  * depth        (depth) float32 -1.0 -3.0 -5.0 ... -1960.0 -1980.0 -2000.0
Data variables:
    TEMP         (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    TEMP_ERR     (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    TEMP_PCTVAR  (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    PSAL         (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    PSAL_ERR     (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    PSAL_PCTVAR  (depth, latitude, longitude) float32 dask.array<shape=(152, 53, 61), chunksize=(152, 53, 61)>
    SST          (latitude, longitude) float32 dask.array<shape=(53, 61), chunksize=(53, 61)>
    PCM_LABELS   (latitude, longitude) float64 4.0 4.0 4.0 4.0 ... 1.0 1.0 1.0

where you can see the adition of the PCM_LABELS variable.

Vertical structure of classes

One key outcome of the PCM analysis if the vertical structure of each classes. This can be computed using the :meth:pyxpcm.stat.quantile method.

Below we compute the 5, 50 and 95% quantiles for temperature and salinity of each classes:

[14]:
for vname in ['TEMP', 'PSAL']:
    ds = ds.pyxpcm.quantile(m, q=[0.05, 0.5, 0.95], of=vname, outname=vname + '_Q', keep_attrs=True, inplace=True)
ds
[14]:
<xarray.Dataset>
Dimensions:     (DEPTH: 282, N_PROF: 7560, pcm_class: 8, quantile: 3)
Coordinates:
  * pcm_class   (pcm_class) int64 0 1 2 3 4 5 6 7
  * N_PROF      (N_PROF) int64 0 1 2 3 4 5 6 ... 7554 7555 7556 7557 7558 7559
  * DEPTH       (DEPTH) float32 0.0 -5.0 -10.0 -15.0 ... -1395.0 -1400.0 -1405.0
  * quantile    (quantile) float64 0.05 0.5 0.95
Data variables:
    LATITUDE    (N_PROF) float32 27.122 27.818 27.452 26.976 ... 4.243 4.15 4.44
    LONGITUDE   (N_PROF) float32 -74.86 -75.6 -74.949 ... -1.263 -0.821 -0.002
    TIME        (N_PROF) datetime64[ns] 2008-06-23T13:07:30 ... 2013-03-09T14:52:58.124999936
    DBINDEX     (N_PROF) float64 1.484e+04 1.622e+04 ... 8.557e+03 1.063e+04
    TEMP        (N_PROF, DEPTH) float32 27.422163 27.422163 ... 4.391791
    PSAL        (N_PROF, DEPTH) float32 36.35267 36.35267 ... 34.910286
    SIG0        (N_PROF, DEPTH) float32 23.601229 23.601229 ... 27.685583
    BRV2        (N_PROF, DEPTH) float32 0.00029447526 ... 4.500769e-06
    PCM_LABELS  (N_PROF) int64 4 4 4 4 4 4 4 4 4 4 4 4 ... 3 3 3 3 3 3 3 3 3 3 3
    PCM_POST    (pcm_class, N_PROF) float64 3.822e-09 3.296e-05 ... 2.269e-70
    TEMP_Q      (pcm_class, quantile, DEPTH) float64 12.73 12.73 ... 5.287 5.273
    PSAL_Q      (pcm_class, quantile, DEPTH) float64 35.12 35.12 ... 35.09 35.09
Attributes:
    Sample test prepared by:  G. Maze
    Institution:              Ifremer/LOPS
    Data source DOI:          10.17882/42182

Quantiles can be plotted using the :func:pyxpcm.plot.quantile method.

[15]:
fig, ax = m.plot.quantile(ds['TEMP_Q'], maxcols=4, figsize=(10, 8), sharey=True)
_images/example_38_0.png

Geographic distribution of classes

Warning

To follow this section you’ll need to have Cartopy installed and working.

A map of labels can now easily be plotted:

[16]:
proj = ccrs.PlateCarree()
subplot_kw={'projection': proj, 'extent': np.array([-80,1,-1,66]) + np.array([-0.1,+0.1,-0.1,+0.1])}
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5,5), dpi=120, facecolor='w', edgecolor='k', subplot_kw=subplot_kw)

kmap = m.plot.cmap()
sc = ax.scatter(ds['LONGITUDE'], ds['LATITUDE'], s=3, c=ds['PCM_LABELS'], cmap=kmap, transform=proj, vmin=0, vmax=m.K)
cl = m.plot.colorbar(ax=ax)

gl = m.plot.latlongrid(ax, dx=10)
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.COASTLINE)
ax.set_title('LABELS of the training set')
plt.show()
_images/example_42_0.png

Since we predicted labels for 2 datasets, we can superimpose them

[17]:
proj = ccrs.PlateCarree()
subplot_kw={'projection': proj, 'extent': np.array([-75,-35,25,55]) + np.array([-0.1,+0.1,-0.1,+0.1])}
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5,5), dpi=120, facecolor='w', edgecolor='k', subplot_kw=subplot_kw)

kmap = m.plot.cmap()
sc = ax.pcolor(ds_gridded['longitude'], ds_gridded['latitude'], ds_gridded['PCM_LABELS'], cmap=kmap, transform=proj, vmin=0, vmax=m.K)
sc = ax.scatter(ds['LONGITUDE'], ds['LATITUDE'], s=10, c=ds['PCM_LABELS'], cmap=kmap, transform=proj, vmin=0, vmax=m.K, edgecolors=[0.3]*3, linewidths=0.3)
cl = m.plot.colorbar(ax=ax)

gl = m.plot.latlongrid(ax, dx=10)
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.COASTLINE)
ax.set_title('LABELS of the training set (dots) and another product (shade)')
plt.show()
_images/example_44_0.png

Posteriors are defined for each data point and give the probability of that point to belong to any of the classes. It can be plotted this way:

[18]:
cmap = sns.light_palette("blue", as_cmap=True)
proj = ccrs.PlateCarree()
subplot_kw={'projection': proj, 'extent': np.array([-80,1,-1,66]) + np.array([-0.1,+0.1,-0.1,+0.1])}
fig, ax = m.plot.subplots(figsize=(10,22), maxcols=2, subplot_kw=subplot_kw)

for k in m:
    sc = ax[k].scatter(ds['LONGITUDE'], ds['LATITUDE'], s=3, c=ds['PCM_POST'].sel(pcm_class=k),
                       cmap=cmap, transform=proj, vmin=0, vmax=1)
    cl = plt.colorbar(sc, ax=ax[k], fraction=0.03)
    gl = m.plot.latlongrid(ax[k], fontsize=8, dx=20, dy=10)
    ax[k].add_feature(cfeature.LAND)
    ax[k].add_feature(cfeature.COASTLINE)
    ax[k].set_title('PCM Posteriors k=%i' % k)
_images/example_46_0.png

PCM properties

The PCM class does a lot of data preprocessing under the hood in order to classify profiles.

Here is how to access PCM preprocessed results and data.

Import and set-up

Import the library and toy data

[2]:
import pyxpcm
from pyxpcm.models import pcm

Let’s work with a standard PCM of temperature and salinity, from the surface down to -1000m:

[3]:
# Define a vertical axis to work with
z = np.arange(0.,-1000,-10.)

# Define features to use
features_pcm = {'temperature': z, 'salinity': z}

# Instantiate the PCM
m = pcm(K=4, features=features_pcm, maxvar=2)
print(m)
<pcm 'gmm' (K: 4, F: 2)>
Number of class: 4
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: False
Feature: 'temperature'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing.data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition.pca.PCA'>
Feature: 'salinity'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing.data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition.pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture.gaussian_mixture.GaussianMixture'>

Note that here we used a strong dimensionality reduction to limit the dimensions and size of the plots to come (maxvar==2 tell the PCM to use the first 2 PCAs of each variables).

Now we can load a dataset to be used for fitting.

[4]:
ds = pyxpcm.tutorial.open_dataset('argo').load()

Fit and predict model on data:

[5]:
features_in_ds = {'temperature': 'TEMP', 'salinity': 'PSAL'}
ds = ds.pyxpcm.fit_predict(m, features=features_in_ds, inplace=True)
print(ds)
<xarray.Dataset>
Dimensions:     (DEPTH: 282, N_PROF: 7560)
Coordinates:
  * N_PROF      (N_PROF) int64 0 1 2 3 4 5 6 ... 7554 7555 7556 7557 7558 7559
  * DEPTH       (DEPTH) float32 0.0 -5.0 -10.0 -15.0 ... -1395.0 -1400.0 -1405.0
Data variables:
    LATITUDE    (N_PROF) float32 ...
    LONGITUDE   (N_PROF) float32 ...
    TIME        (N_PROF) datetime64[ns] ...
    DBINDEX     (N_PROF) float64 ...
    TEMP        (N_PROF, DEPTH) float32 27.422163 27.422163 ... 4.391791
    PSAL        (N_PROF, DEPTH) float32 36.35267 36.35267 ... 34.910286
    SIG0        (N_PROF, DEPTH) float32 ...
    BRV2        (N_PROF, DEPTH) float32 ...
    PCM_LABELS  (N_PROF) int64 1 1 1 1 1 1 1 1 1 1 1 1 ... 3 3 3 3 3 3 3 3 3 3 3
Attributes:
    Sample test prepared by:  G. Maze
    Institution:              Ifremer/LOPS
    Data source DOI:          10.17882/42182

Scaler properties

[6]:
fig, ax = m.plot.scaler()

# More options:
# m.plot.scaler(style='darkgrid')
# m.plot.scaler(style='darkgrid', subplot_kw={'ylim':[-1000,0]})
_images/pcm_prop_11_0.png

Reducer properties

Plot eigen vectors for a PCA reducer or nothing if no reduced used

[7]:
fig, ax = m.plot.reducer()
# Equivalent to:
# pcmplot.reducer(m)

# More options:
# m.plot.reducer(pcalist = range(0,4));
# m.plot.reducer(pcalist = [0], maxcols=1);
# m.plot.reducer(pcalist = range(0,4), style='darkgrid',  plot_kw={'linewidth':1.5}, subplot_kw={'ylim':[-1400,0]}, figsize=(12,10));
_images/pcm_prop_13_0.png

Scatter plot of features, as seen by the classifier

You can have access to pre-processed data for your own plot/analysis through the pyxpcm.pcm.preprocessing() method:

[8]:
X, sampling_dims = m.preprocessing(ds, features=features_in_ds)
X
[8]:
<xarray.DataArray (n_samples: 7560, n_features: 4)>
array([[ 1.928166, -0.091499,  1.7341  , -0.270248],
       [ 2.314077,  0.106842,  2.083683, -0.18765 ],
       [ 1.675512, -0.17313 ,  1.563701, -0.432449],
       ...,
       [-0.802601, -0.578377, -1.576134, -0.311841],
       [-0.955218, -0.609439, -1.804922, -0.427322],
       [-0.892514, -0.623732, -1.792266, -0.465512]], dtype=float32)
Coordinates:
  * n_samples   (n_samples) int64 0 1 2 3 4 5 ... 7554 7555 7556 7557 7558 7559
  * n_features  (n_features) <U13 'temperature_0' ... 'salinity_1'

pyXpcm return a 2-dimensional xarray.DataArray for which pairwise relationship can easily be visualise with the pyxpcm.plot.preprocessed() method (this requires Seaborn):

[9]:
g = m.plot.preprocessed(ds, features=features_in_ds, style='darkgrid')

# A posteriori adjustements:
# g.set(xlim=(-3,3),ylim=(-3,3))
# g.savefig('toto.png')
_images/pcm_prop_17_0.png
[10]:
# Combine KDE with histrograms (very slow plot, so commented here):
g = m.plot.preprocessed(ds, features=features_in_ds, kde=True)
/Users/gmaze/anaconda/envs/obidam36/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
_images/pcm_prop_18_1.png

Save and load PCM from local files

PCM instances are light weigth python objects and can easily be saved on and loaded from files. pyXpcm uses the netcdf file format because it is easy to add meta-data to numerical arrays.

Import and set-up

Import the library and toy data

[2]:
import pyxpcm
from pyxpcm.models import pcm

# Load tutorial data:
ds = pyxpcm.tutorial.open_dataset('argo').load()

Saving a model

Let’s first create a PCM and fit it onto the tutorial dataset:

[3]:
# Define a vertical axis to work with
z = np.arange(0.,-1000,-10.)

# Define features to use
features_pcm = {'temperature': z, 'salinity': z}

# Instantiate the PCM:
m = pcm(K=4, features=features_pcm)

# Fit:
m.fit(ds, features={'temperature': 'TEMP', 'salinity': 'PSAL'})
[3]:
<pcm 'gmm' (K: 4, F: 2)>
Number of class: 4
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: True
Feature: 'temperature'
     Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
     Scaler: 'normal', <class 'sklearn.preprocessing.data.StandardScaler'>
     Reducer: True, <class 'sklearn.decomposition.pca.PCA'>
Feature: 'salinity'
     Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
     Scaler: 'normal', <class 'sklearn.preprocessing.data.StandardScaler'>
     Reducer: True, <class 'sklearn.decomposition.pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture.gaussian_mixture.GaussianMixture'>
     log likelihood of the training set: 33.234426

We can now save the fitted model to a local file:

[4]:
m.to_netcdf('my_pcm.nc')

Loading a model

To load a PCM from file, use:

[5]:
m_loaded = pyxpcm.load_netcdf('my_pcm.nc')
m_loaded
[5]:
<pcm 'gmm' (K: 4, F: 2)>
Number of class: 4
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: True
Feature: 'temperature'
     Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
     Scaler: 'normal', <class 'sklearn.preprocessing.data.StandardScaler'>
     Reducer: True, <class 'sklearn.decomposition.pca.PCA'>
Feature: 'salinity'
     Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
     Scaler: 'normal', <class 'sklearn.preprocessing.data.StandardScaler'>
     Reducer: True, <class 'sklearn.decomposition.pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture.gaussian_mixture.GaussianMixture'>
     log likelihood of the training set: 33.234426

Debugging and performances

Import and set-up

Import the library and toy data

[2]:
import pyxpcm
from pyxpcm.models import pcm

# Load a dataset to work with:
ds = pyxpcm.tutorial.open_dataset('argo').load()

# Define vertical axis and features to use:
z = np.arange(0.,-1000.,-10.)
features_pcm = {'temperature': z, 'salinity': z}
features_in_ds = {'temperature': 'TEMP', 'salinity': 'PSAL'}

Debugging

Use option debug to print log messages

[3]:
# Instantiate a new PCM:
m = pcm(K=8, features=features_pcm, debug=True)

# Fit with log:
m.fit(ds, features=features_in_ds);
> Start preprocessing for action 'fit'

        > Preprocessing xarray dataset 'TEMP' as PCM feature 'temperature'
         [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (282,))] X RAVELED with success
                Output axis is in the input axis, not need to interpolate, simple intersection
         [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (100,))] X INTERPOLATED with success)
         [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None] X SCALED with success)
         [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None] X REDUCED with success)
        temperature pre-processed with success,  [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
        Homogenisation for fit of temperature

        > Preprocessing xarray dataset 'PSAL' as PCM feature 'salinity'
         [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (282,))] X RAVELED with success
                Output axis is in the input axis, not need to interpolate, simple intersection
         [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (100,))] X INTERPOLATED with success)
         [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None] X SCALED with success)
         [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None] X REDUCED with success)
        salinity pre-processed with success,  [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
        Homogenisation for fit of salinity
        Features array shape and type for xarray: (7560, 30) <class 'numpy.ndarray'> <class 'memoryview'>
> Preprocessing done, working with final X (<class 'xarray.core.dataarray.DataArray'>) array of shape: (7560, 30)  and sampling dimensions: ['N_PROF']

Performance / Optimisation

Use timeit and timeit_verb to compute computation time of PCM operations

Times are accessible as a pandas Dataframe in timeit pyXpcm instance property.

The pyXpcm m.plot.timeit() plot method allows for a simple visualisation of times.

Time readings during execution
[4]:
# Create a PCM and execute methods:
m = pcm(K=8, features=features_pcm, timeit=True, timeit_verb=1)
m.fit(ds, features=features_in_ds);
  fit.1-preprocess.1-mask: 62 ms
  fit.1-preprocess.2-feature_temperature.1-ravel: 27 ms
  fit.1-preprocess.2-feature_temperature.2-interp: 2 ms
  fit.1-preprocess.2-feature_temperature.3-scale_fit: 15 ms
  fit.1-preprocess.2-feature_temperature.4-scale_transform: 6 ms
  fit.1-preprocess.2-feature_temperature.5-reduce_fit: 21 ms
  fit.1-preprocess.2-feature_temperature.6-reduce_transform: 7 ms
  fit.1-preprocess.2-feature_temperature.total: 80 ms
  fit.1-preprocess: 80 ms
  fit.1-preprocess.3-homogeniser: 5 ms
  fit.1-preprocess.2-feature_salinity.1-ravel: 32 ms
  fit.1-preprocess.2-feature_salinity.2-interp: 1 ms
  fit.1-preprocess.2-feature_salinity.3-scale_fit: 11 ms
  fit.1-preprocess.2-feature_salinity.4-scale_transform: 5 ms
  fit.1-preprocess.2-feature_salinity.5-reduce_fit: 18 ms
  fit.1-preprocess.2-feature_salinity.6-reduce_transform: 4 ms
  fit.1-preprocess.2-feature_salinity.total: 75 ms
  fit.1-preprocess: 75 ms
  fit.1-preprocess.3-homogeniser: 1 ms
  fit.1-preprocess.4-xarray: 1 ms
  fit.1-preprocess: 228 ms
  fit.fit: 3400 ms
  fit.score: 12 ms
  fit: 3642 ms
A posteriori Execution time analysis
[5]:
# Create a PCM and execute methods:
m = pcm(K=8, features=features_pcm, timeit=True, timeit_verb=0)
m.fit(ds, features=features_in_ds);
m.predict(ds, features=features_in_ds);
m.fit_predict(ds, features=features_in_ds);

Execution times are accessible through a dataframe with the pyxpcm.pcm.timeit property

[6]:
m.timeit
[6]:
Method       Sub-method    Sub-sub-method         Sub-sub-sub-method
fit          1-preprocess  1-mask                 total                   60.623884
                           2-feature_temperature  1-ravel                 29.070854
                                                  2-interp                 1.361847
                                                  3-scale_fit             24.303198
                                                  4-scale_transform        5.542994
                                                  5-reduce_fit            17.215014
                                                  6-reduce_transform       4.530907
                                                  total                   82.225800
                           total                                         405.465841
                           3-homogeniser          total                    3.330231
                           2-feature_salinity     1-ravel                 33.647060
                                                  2-interp                 1.427889
                                                  3-scale_fit             19.104004
                                                  4-scale_transform       16.283989
                                                  5-reduce_fit            13.432264
                                                  6-reduce_transform       3.180981
                                                  total                   87.301970
                           4-xarray               total                    1.182079
             fit           total                                        1668.042660
             score         total                                          14.346838
             total                                                      1918.222189
predict      1-preprocess  1-mask                 total                   64.723015
                           2-feature_temperature  1-ravel                 28.513908
                                                  2-interp                 1.239061
                                                  3-scale_fit              0.003099
                                                  4-scale_transform        7.060051
                                                  5-reduce_fit             0.002146
                                                  6-reduce_transform       2.730846
                                                  total                   39.700031
                           total                                         235.766172
                                                                           ...
                           2-feature_salinity     6-reduce_transform       2.788067
                                                  total                   44.227123
                           4-xarray               total                    1.113892
             predict       total                                          10.058880
             score         total                                          11.398077
             xarray        total                                          11.323929
             total                                                       184.562922
fit_predict  1-preprocess  1-mask                 total                   64.216852
                           2-feature_temperature  1-ravel                 26.321888
                                                  2-interp                 1.183033
                                                  3-scale_fit              0.001907
                                                  4-scale_transform        5.228996
                                                  5-reduce_fit             0.000954
                                                  6-reduce_transform       2.723217
                                                  total                   35.592079
                           total                                         224.620104
                           3-homogeniser          total                    2.858639
                           2-feature_salinity     1-ravel                 29.989958
                                                  2-interp                 1.201153
                                                  3-scale_fit              0.000954
                                                  4-scale_transform        5.232811
                                                  5-reduce_fit             0.001907
                                                  6-reduce_transform       4.884958
                                                  total                   41.451693
                           4-xarray               total                    1.657963
             fit           total                                        2717.261076
             score         total                                          11.775970
             predict       total                                          10.827065
             xarray        total                                          10.989189
             total                                                      2898.393869
Length: 66, dtype: float64

Visualisation help

To facilitate your analysis of execution times, you can use pyxpcm.plot.timeit().

Main steps by method
[7]:
fig, ax, df = m.plot.timeit(group='Method', split='Sub-method', style='darkgrid') # Default group/split
df
[7]:
Sub-method 1-preprocess fit predict score xarray
Method
fit 809.230804 1668.042660 NaN 14.346838 NaN
fit_predict 447.169065 2717.261076 10.827065 11.775970 10.989189
predict 469.947577 NaN 10.058880 11.398077 11.323929
_images/debug_perf_16_1.png
Preprocessing main steps by method
[8]:
fig, ax, df = m.plot.timeit(group='Method', split='Sub-sub-method')
df
[8]:
Sub-sub-method 1-mask 2-feature_salinity 2-feature_temperature 3-homogeniser 4-xarray
Method
fit 60.623884 174.378157 164.250612 3.330231 1.182079
fit_predict 64.216852 82.763433 71.052074 2.858639 1.657963
predict 64.723015 88.269234 79.249144 0.826120 1.113892
_images/debug_perf_18_1.png
Preprocessing details by method
[9]:
fig, ax, df = m.plot.timeit(group='Method', split='Sub-sub-sub-method')
df
[9]:
Sub-sub-sub-method 1-ravel 2-interp 3-scale_fit 4-scale_transform 5-reduce_fit 6-reduce_transform
Method
fit 62.717915 2.789736 43.407202 21.826982 30.647278 7.711887
fit_predict 56.311846 2.384186 0.002861 10.461807 0.002861 7.608175
predict 60.415030 4.472017 0.005245 13.175964 0.004053 5.518913
_images/debug_perf_20_1.png
Preprocessing details by features
[10]:
fig, ax, df = m.plot.timeit(split='Sub-sub-sub-method', group='Sub-sub-method', unit='s')
df
[10]:
Sub-sub-sub-method 1-ravel 2-interp 3-scale_fit 4-scale_transform 5-reduce_fit 6-reduce_transform
Sub-sub-method
2-feature_salinity 0.095538 0.005862 0.019107 0.027633 0.013436 0.010854
2-feature_temperature 0.083907 0.003784 0.024308 0.017832 0.017218 0.009985
_images/debug_perf_22_1.png

Help & reference

Bibliography

What’s New

v0.4 (1 Nov. 2019)

Warning

The API has changed, break backward compatibility.

  • Enhancements:

    • Multiple-features classification
    • ND-Array classification (so that you can classify directly profiles from gridded products, eg: latitude/longitude/time grid, and not only a collection of profiles already in 2D array)
    • pyXpcm methods can be accessed through the xarray.Dataset accessor namespace pyxpcm
    • Allow to choose statistic backends (sklearn, dask_ml or user-defined)
    • Save/load PCM to/from netcdf files
  • pyXpcm now consumes xarray/dask objects all along, not only on the user front-end. This add a small overhead with small dataset but allows for PCM to handle large and more complex datasets.

v0.3 (5 Apr. 2019)

  • Removed support for python 2.7
  • Added more data input consistency checks
  • Fix bug in interpolation and plotting methods
  • Added custom colormap and colorbar to plot module

v0.2 (26 Mar. 2019)

  • Upgrade to python 3.6 (compatible 2.7)
  • Added test for continuous coverage
  • Added score and bic methods
  • Improved vocabulary consistency in methods

v0.1.3 (12 Nov. 2018)

  • Initial release.

API reference

This page provides an auto-generated summary of pyXpcm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.

Top-level PCM functions

Creating a PCM
pcm(K, features[, scaling, reduction, …]) Profile Classification Model class constructor
pyxpcm.load_netcdf(ncfile) Load a PCM model from netcdf file
Attributes
pcm.K Return the number of classes
pcm.F Return the number of features
pcm.features Return features definition dictionnary
Computation
pcm.fit(self, ds[, features, dim]) Estimate PCM parameters
pcm.fit_predict(self, ds[, features, dim, …]) Estimate PCM parameters and predict classes.
pcm.predict(self, ds[, features, dim, …]) Predict labels for profile samples
pcm.predict_proba(self, ds[, features, dim, …]) Predict posterior probability of each components given the data
pcm.score(self, ds[, features, dim]) Compute the per-sample average log-likelihood of the given data
pcm.bic(self, ds[, features, dim]) Compute Bayesian information criterion for the current model on the input dataset

Low-level PCM properties and functions

pcm.timeit Return a pandas.DataFrame with Execution time of method called on this instance
pcm.ravel(self, da[, dim, feature_name]) Extract from N-d array a X(feature,sample) 2-d array and vertical dimension z
pcm.unravel(self, ds, sampling_dims, X) Create a DataArray from a numpy array and sampling dimensions

Plotting

pcm.plot Access plotting functions
Plot PCM Contents
plot.quantile(m, da[, xlim, classdimname, …]) Plot q-th quantiles of a dataArray for each PCM components
plot.scaler(m[, style, plot_kw, subplot_kw]) Plot PCM scalers properties
plot.reducer(m[, pcalist, style, maxcols, …]) Plot PCM reducers properties
plot.preprocessed(m, ds[, features, dim, n, …]) Plot preprocessed features as pairwise scatter plots
plot.timeit(m[, group, split, subplot_kw, …]) Plot PCM registered timing of operations
Tools
plot.cmap(m, name[, palette, usage]) Return categorical colormaps
plot.colorbar(m[, cmap]) Add a colorbar to the current plot with centered ticks on discrete colors
plot.subplots(m[, maxcols, K, subplot_kw]) Return (figure, axis) with one subplot per cluster
plot.latlongrid(ax[, dx, dy, fontsize]) Add latitude/longitude grid line and labels to a cartopy geoaxes

Statistics

pcm.stat Access statistics functions
stat.quantile(ds[, q, of, using, outname, …]) Compute q-th quantile of a xarray.DataArray for each PCM components
stat.robustness(ds[, name, classdimname, …]) Compute classification robustness
stat.robustness_digit(ds[, name, …]) Digitize classification robustness

Save/load PCM models

pcm.to_netcdf(self, ncfile, \*\*ka) Save a PCM to a netcdf file
pyxpcm.load_netcdf(ncfile) Load a PCM model from netcdf file

Helper

tutorial.open_dataset(name) Open a dataset from the pyXpcm online data repository (requires internet).

Xarray pyxpcm name space

Provide accessor to enhance interoperability between xarray and pyxpcm.

Provide a scope named pyxpcm as accessor to xarray.Dataset objects.

class pyxpcm.xarray.pyXpcmDataSetAccessor[source]

Class registered under scope pyxpcm to access xarray.Dataset objects.

add(self, da)[source]

Add a xarray.DataArray to this xarray.Dataset

bic(self, this_pcm, **kwargs)[source]

Compute Bayesian information criterion for the current model on the input dataset

Only for a GMM classifier

Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

Returns:
bic: float

The lower the better

drop_all(self)[source]

Remove xarray.DataArray created with pyXpcm front this xarray.Dataset

feature_dict(self, this_pcm, features=None)[source]

Return dictionary of features for this xarray.Dataset and a PCM

Parameters:
pcm : pyxpcm.pcmmodel.pcm
features : dict

Keys are PCM feature name, Values are corresponding xarray.Dataset variable names

Returns:
dict()

Dictionary where keys are PCM feature names and values the corresponding xarray.Dataset variables

fit(self, this_pcm, **kwargs)[source]

Estimate PCM parameters

For a PCM, the fit method consists in the following operations:

  • pre-processing
    • interpolation to the feature_axis levels of the model
    • scaling
    • reduction
  • estimate classifier parameters
Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

Returns:
self
fit_predict(self, this_pcm, **kwargs)[source]

Estimate PCM parameters and predict classes.

This method add these properties to the PCM object:

  • llh: The log likelihood of the model with regard to new data
Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

inplace: boolean, False by default

If False, return a xarray.DataArray with predicted labels If True, return the input xarray.Dataset with labels added as a new xarray.DataArray

name: string (‘PCM_LABELS’)

Name of the DataArray holding labels.

Returns:
xarray.DataArray

Component labels (if option ‘inplace’ = False)

or
xarray.Dataset

Input dataset with component labels as a ‘PCM_LABELS’ new xarray.DataArray (if option ‘inplace’ = True)

mask(self, this_pcm, features=None, dim=None)[source]

Create a mask where all PCM features are defined

Create a mask where all feature profiles are not null over the this_pcm feature axis.

Parameters:
:class:`pyxpcm.pcmmodel.pcm`
features : dict()

Definitions of this_pcm features in the xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim : str

Name of the vertical dimension in the xarray.Dataset. If not specified or set to None, dim is identified as the xarray.DataArray variables with attributes ‘axis’ set to ‘z’.

Returns:
xarray.DataArray
predict(self, this_pcm, inplace=False, **kwargs)[source]

Predict labels for profile samples

This method add these properties to the PCM object:

  • llh: The log likelihood of the model with regard to new data
Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

inplace: boolean, False by default

If False, return a xarray.DataArray with predicted labels If True, return the input xarray.Dataset with labels added as a new xarray.DataArray

name: str, default is ‘PCM_LABELS’

Name of the xarray.DataArray with labels

Returns:
xarray.DataArray

Component labels (if option ‘inplace’ = False)

or
xarray.Dataset

Input dataset with Component labels as a ‘PCM_LABELS’ new xarray.DataArray (if option ‘inplace’ = True)

predict_proba(self, this_pcm, **kwargs)[source]

Predict posterior probability of each components given the data

This method adds these properties to the PCM instance:

  • llh: The log likelihood of the model with regard to new data
Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

inplace: boolean, False by default

If False, return a xarray.DataArray with predicted probabilities If True, return the input xarray.Dataset with probabilities added as a new xarray.DataArray

name: str, default is ‘PCM_POST’

Name of the DataArray with prediction probability (posteriors)

classdimname: str, default is ‘pcm_class’

Name of the dimension holding classes

Returns:
xarray.DataArray

Probability of each Gaussian (state) in the model given each sample (if option ‘inplace’ = False)

or
xarray.Dataset

Input dataset with Component Probability as a ‘PCM_POST’ new xarray.DataArray (if option ‘inplace’ = True)

quantile(self, this_pcm, inplace=False, **kwargs)[source]

Compute q-th quantile of a xarray.DataArray for each PCM components

Parameters:
q: float in the range of [0,1] (or sequence of floats)

Quantiles to compute, which must be between 0 and 1 inclusive.

of: str

Name of the xarray.Dataset variable to compute quantiles for.

using: str

Name of the xarray.Dataset variable with classification labels to use. Use ‘PCM_LABELS’ by default.

outname: ‘PCM_QUANT’ or str

Name of the xarray.DataArray with quantile

keep_attrs: boolean, False by default

Preserve of xarray.Dataset attributes or not in the new quantile variable.

Returns:
xarray.Dataset with shape (K, n_quantiles, N_z=n_features)
or
xarray.DataArray with shape (K, n_quantiles, N_z=n_features)
robustness(self, this_pcm, inplace=False, **kwargs)[source]

Compute classification robustness

Parameters:
name: str, default is ‘PCM_POST’

Name of the xarray.DataArray with prediction probability (posteriors)

classdimname: str, default is ‘pcm_class’

Name of the dimension holding classes

outname: ‘PCM_ROBUSTNESS’ or str

Name of the xarray.DataArray with robustness

inplace: boolean, False by default

If False, return a xarray.DataArray with robustness If True, return the input xarray.Dataset with robustness added as a new xarray.DataArray

Returns:
xarray.Dataset if inplace=True
or
xarray.DataArray if inplace=False
robustness_digit(self, this_pcm, inplace=False, **kwargs)[source]

Digitize classification robustness

Parameters:
ds: :class:`xarray.Dataset`

Input dataset

name: str, default is ‘PCM_POST’

Name of the xarray.DataArray with prediction probability (posteriors)

classdimname: str, default is ‘pcm_class’

Name of the dimension holding classes

outname: ‘PCM_ROBUSTNESS_CAT’ or str

Name of the xarray.DataArray with robustness categories

inplace: boolean, False by default

If False, return a xarray.DataArray with robustness If True, return the input xarray.Dataset with robustness categories added as a new xarray.DataArray

Returns:
xarray.Dataset if inplace=True
or
xarray.DataArray if inplace=False
sampling_dim(self, this_pcm, features=None, dim=None)[source]

Return the list of dimensions to be stacked for sampling

Parameters:
pcm : pyxpcm.pcm
features : None (default) or dict()

Keys are PCM feature name, Values are corresponding xarray.Dataset variable names. It set to None, all PCM features are used.

dim : None (default) or str()

The xarray.Dataset dimension to use as vertical axis in all features. If set to None, it is automatically set to the dimension with an attribute axis set to Z.

Returns:
dict()

Dictionary where keys are xarray.Dataset variable names of features and values are another dictionary with the list of sampling dimension in DIM_SAMPLING key and the name of the vertical axis in the DIM_VERTICAL key.

score(self, this_pcm, **kwargs)[source]

Compute the per-sample average log-likelihood of the given data

Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

Returns:
log_likelihood: float

In the case of a GMM classifier, this is the Log likelihood of the Gaussian mixture given data

split(self)[source]

Split pyXpcm variables from the original xarray.Dataset

Returns:
xarray.Dataset, xarray.Dataset

Two DataSest: one with pyXpcm variables, one with the original DataSet