NASA SBIR Phase I - Open Source Parallel Image Analysis and Machine Learning Pipeline¶
Getting Started
- About Ensemble Learning Models
- Install ELM
- Use Cases
- elm-hello-world.rst
- clustering_example.rst
- Examples
About Ensemble Learning Models¶
Ensemble Learning Models (elm
) is a set of tools for parallel ensemble unsupervised and supervised machine learning, with a focus on data structures useful in climate science and satellite imagery.
elm
Capabilities¶
- Ensemble learning
- Large scale prediction
- Genetic algorithms
- Common preprocessing operations for satellite imagery and climate data
These capabilities are best shown in the
elm
wraps together the following Python packages:
- dask-distributed:
elm
uses dask-distributed for parallelism over ensemble fitting and prediction - scikit-learn :
elm
can use unsupervised and supervised models, preprocessors, scoring functions, and postprocessors fromscikit-learn
or any estimator that follows the scikit-learn initialize / fit / predict estimator interface. - xarray :
elm
wraps xarray data structures for n-dimensional arrays, such as 3-dimensional weather cubes, and for collections of 2-D rasters, such as a LANDSAT sample
elm
is a Work in Progress¶
elm
is immature and largely for experimental use.
The developers do not promise backwards compatibility with future versions.
Next steps¶
Install ELM¶
You can install elm with conda
or by installing from source.
Install from Conda¶
Mac-OSX and Linux: To install the latest release of elm
on Mac-OSX or Linux:
conda create --name elm-env -c elm -c conda-forge elm python=3.5
source activate elm-env
Windows: To install the latest release of elm
on Windows:
conda create --name elm-env -c elm -c conda-forge elm python=3.4
activate elm-env
If you have any trouble, adjust the python=3.4 to python=3.5 as package availability may change in the future.
This installs elm and all common dependencies. The channel arguments (-c elm -c conda-forge
) are typically required.
Install from Source¶
To install elm from source, clone the repository from github:
git clone https://github.com/ContinuumIO/elm.git
cd elm
conda env create
source activate elm-env
python setup.py develop
Clone the elm-data
repo and pull using Git Large File Storage (LFS) so that more tests can be run:
brew install git-lfs # or apt-get, yum, etc
git lfs install
git clone https://github.com/ContinuumIO/elm-data
git remote add origin https://github.com/ContinuumIO/elm-data
Add the following to your .bashrc or environment, changing the path depending on where you have cloned elm-data
:
export ELM_EXAMPLE_DATA_PATH=/Users/peter/Documents/elm-data
Do the tutorials and examples:
Use Cases¶
elm
(Ensemble Learning Models) is a versatile set of tools for ensemble and evolutionary algorithm approaches to training and selecting machine learning models and large scale prediction from trained models. elm
has a focus on data structures that are common in satellite and weather data analysis, such as rasters representing bands of satellite data or cubes of weather model output.
Common computational challenges in satellite and weather data machine learning include:
To address these challenges elm
draws from existing Python packages:
Large-Scale Model Training¶
elm
offers the following strategies for large scale training:
- Use of
partial_fit
for incremental training on series of saemples - Ensemble modeling, training batches of models in generations in parallel, with model selection after each generation
- Use of a Pipeline with a sequence of transformation steps
partial_fit
for incremental training of transformers used inPipeline
steps, such as PCA- Custom user-given model selection logic in ensemble approaches to training
elm
can use dask
to parallelize the activities above.
More reading:
Model Uncertainty¶
Ensemble modeling can be used to account for uncertainty that arises from uncertain model parameters or uncertainty in the fitting process. The ensemble approach in elm
allows training and prediction from an ensemble where model parameters are varied, including parameters related to preprocessing transformations, such as feature selection or PCA transforms. See the predict_many example.
Hyperparameterization / Model Selection¶
elm
offers two different algorithms for multi-model training with model selection:
- fit_ensemble: Running one batch of models at a time (a generation), running a user-given model selection function after each generation
- fit_ea: Using the NSGA-2 evolutionary algorithm to select best parameters for the best fit.
In either of these algorithms elm
can use most of the model scoring features of scikit-learn
or a user-given model scoring callable.
See also:
- fit_ensemble
- fit_ea
elm.model_selection
in API docsscikit-learn
scoring classes that work withelm
Data/Metadata Formats¶
One challenge in satellite and weather data processing is the variety of input data formats, including GeoTiff, NetCDF, HDF4, HDF5, and others. elm
offers a function load_array
which can load spatial array data in the following formats:
- GeoTiff: Loads files from a directory of GeoTiffs, assuming each is a single-band raster
- NetCDF: Loads variables from a NetCDF file
- HDF4 / HDF5: Loads subdatasets from HDF4 and HDF5 files
load_array
creates an ElmStore
(read more here), a fundamental data structure in elm
that is essentially an xarray.Dataset
with metadata standardization over the various file types.
Preprocessing Input Data¶
elm
has a wide range of support for preprocessing activities. One important feature of elm
is its ability to train and/or predict from more than one sample and for each sample run a series of preprocessing steps that may include:
- Scaling, adding polynomial features, or other preprocessors from
sklearn.preprocessing
- Feature selection using any class from
sklearn.feature_selection
- Flattening collections of rasters to a single 2-D matrix for fitting / prediction
- Running user-given sample transformers
- Resampling one raster onto another raster’s coordinates
- In-polygon selection
- Feature extraction through transform models like PCA or ICA
See elm.pipeline.steps for more information on preprocessing.
Predicting for Many Large Samples and/or Models¶
elm
can use dask-distributed, a dask thread pool, or serial processing for predicting over a group (ensemble) of models and a single sample or series of samples. elm
’s interface for large scale prediction, described here, is via the predict_many method of a Pipeline
instance.
elm
is a Work in Progress¶
elm
is immature and largely for experimental use.
The developers do not promise backwards compatibility with future versions.
elm
Intro¶
This tutorial is a Hello World example with elm
Step 1 - Choose Model(s)¶
First import model(s) from scikit-learn and Pipeline
and steps
from elm.pipeline
:
from elm.config import client_context
from elm.pipeline.tests.util import random_elm_store
from elm.pipeline import Pipeline, steps
from sklearn.decomposition import PCA
from sklearn.cluster import AffinityPropagation
random_elm_store
is a function that returns random rasters (xarray.DataArray
s) in an ElmStore, a data structure similar to an xarray.Datasetsteps
is a module of all the transformation steps possible in a Pipeline
See the LANDSAT K-Means and other examples to see how to read an ElmStore from GeoTiff, HDF4
, HDF5
, or NetCDF
.
Step 2 - Define a sampler
¶
If fitting more than one sample, then define a sampler
function to pass to fit_ensemble. Here we are using a partial of random_elm_store
(synthetic data). If using a sampler
, we also need to define args_list
a list of tuples where each tuple can be unpacked as arguments to sampler
. The length of args_list
determines the number of samples potentially used. Here we have 2 empty tuples as args_list
because our sampler
needs no arguments and we want 2 samples. Alternatively the arguments X
, y
, and sample_weight
may be given in place of sampler
and args_list
.
from functools import partial
N_SAMPLES = 2
bands = ['band_{}'.format(idx + 1) for idx in range(10)]
sampler = partial(random_elm_store,
bands=bands,
width=60,
height=60)
args_list = [(),] * N_SAMPLES
Step 2 - Define a Pipeline¶
The code block below will use Flatten
to convert each 2-D raster ( DataArray
) to give a single 1-D column in 2-D DataArray
for machine learning. The output of Flatten
will be in turn passed to sklearn.decomposition.PCA and the reduced feature set from PCA
will be passed to the sklearn.cluster.AffinityPropagation clustering model.
pipe = Pipeline([('flat', steps.Flatten()),
('pca', steps.Transform(PCA())),
('aff_prop', AffinityPropagation())])
Step 3 - Call fit_ensemble with dask
¶
Now we can use fit_ensemble to fit to one or more samples and one more instances of the pipe
Pipeline above. Below we are passing the sampler
and args_list
, client
, which will be a dask-distributed
or ThreadPool
or None, depending on environment variables. init_ensemble_size
sets the number of Pipeline instances and models_share_sample=False
means to fit all Pipeline
/ sample combinations (2 X 2 == 4
total members in this case).
with client_context() as client:
pipe.fit_ensemble(sampler=sampler,
args_list=args_list,
client=client,
init_ensemble_size=2,
models_share_sample=False,
ngen=1)
The code block with fit_ensemble above would show the repr
of the Pipeline
object as follows:
<elm.pipeline.Pipeline> with steps:
flat: <elm.steps.Flatten>:
pca: <elm.steps.Transform>:
copy: True
iterated_power: 'auto'
n_components: None
partial_fit_batches: None
random_state: None
svd_solver: 'auto'
tol: 0.0
whiten: False
aff_prop: AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
damping=0.5, max_iter=200, preference=None, verbose=False)
We can confirm that we have 4
Pipeline instances in the trained ensemble:
>>> len(pipe.ensemble)
4
Step 4 - Call predict_many¶
predict_many will by default predict from the ensemble that was just trained (4 models in this case). predict_many takes sampler
and args_list
like fit_ensemble. The args_list
may differ from that given to fit_ensemble
or be the same. We have 4 trained models in the .ensemble
attribute of pipe
and 2 samples specified by args_list
, so predict_many returns a list of 8 prediction :doc:`ElmStore<elm-store>`s
import matplotlib.pyplot as plt
with client_context() as client:
preds = pipe.predict_many(sampler=sampler, args_list=args_list, client=client)
example = preds[0]
example.predict.plot.pcolormesh()
plt.show()
Read More : LANDSAT K-Means example
elm
Overview / Example¶
This page walks through a Jupyter
notebook using elm
to ensemble fit K-Means and predict from all members of the ensemble.
It demonstrates the common steps of using elm
:
- Working with
elm.readers.load_array
to readNetCDF
,HDF4
,HDF5
, and GeoTiff files, and controlling how a sample is composed of bands or separate rasters withBandSpec
. See also Creating an ElmStore from File- Defining a
Pipeline
of transformers (e.g. normalization and PCA) and an estimator, where the transformers use classes fromelm.pipeline.steps
and the estimator is a model with afit
/predict
interface. See also Pipeline- Calling fit_ensemble to train the Pipeline under varying parameters with one or more input samples
- Calling predict_many to predict from all trained ensemble members to one or more input samples
LANDSAT¶
The LANDSAT classification is notebook from elm-examples . This section walks through that notebook, pointing out:
- How to use
elm.readers
for scientific data files like GeoTiffs- How to set up an
elm.pipeline.Pipeline
of transformations- How to use
dask
tofit
aPipeline
in ensemble and predict from many models
NOTE : To follow along, make sure you follow the Prerequisites for Examples. The LANDSAT sample used here can be found in the `AWS S3 LANDSAT bucket here`_.
elm.readers
Walk-Through¶
First the notebook sets some environment variables related to usage of a dask-distributed
Client
and the path to the GeoTiff example files from elm-data:
Each GeoTiff file has 1 raster (band of LANDSAT data):

See more inforation on ElmStore
in ElmStore.
elm.readers.BandSpec
¶
Using a list of BandSpec
objects, as shown below, is how one can control which bands, or individual GeoTiff files, become the sample dimensions for learning:
buf_xsize
: The size of the output raster horizontal dimensionbuf_ysize
: The size of the output raster vertical dimensionname
: What do call the band in theElmStore
returned. For exampleband_1
as a name will mean you can useX.band_1
and findband_1
as a key inX.data_vars
.search_key
: Where to look for the band identifiying info, in this case the filename
search_value
: What string token identifies a band, e.g.B1.TIF
(see file names printed above)
We are using buf_xsize
and buf_ysize
below to downsample.

Check the repr
of the BandSpec
objects to see all possible arguments controlling reading of bands:

elm.readers.load_array
¶
load_array
aims to find a file reader for a NetCDF
, HDF4
, HDF5
, or GeoTiff source.
The first argument to load_array
is a directory if reading GeoTiff files and it is assumed that the directory contains GeoTiff files each with a 1-band raster.
For NetCDF
, HDF4
, and HDF5
the first argument is a single filename, and the bands are taken from the variables
(NetCDF
) or subdatasets
(HDF4
/ HDF5
).
band_specs
(list of BandSpec
objects) is passed in to load_array
(the list of BandSpec
objects from above) to control which bands are read from the directory of GeoTiffs.

Visualization with ElmStore
¶
The notebook then goes through a number of examples similar to:
X.band_1.plot.pcolormesh()
- The code uses names likeband_1
,band_2
. These are namedDataArray
objects in theElmStore
X
because of thename
argument to theBandSpec
objects above. Theplot.pcolormesh()
comes from the data viz tools with xarray.DataArray .- The output of
X.band_1.plot.pcolormesh()

Building a Pipeline
¶
Building an elm.pipeline.Pipeline
of transformations is similar to the idea of a Pipeline in scikit-learn.
- All steps but the last step in a Pipeline must be instances of classes from the elm.pipeline.steps - these are the transformers.
- The final step in a Pipeline should be an estimator from scikit-learn with a fit/predict interface.
- The notebook shows how to specify a several-step Pipeline of
- Flattening rasters
- Assigning NaN where needed
- Dropping NaN rows
- Standardizing (Z-scoring) by band means and standard deviations
- Adding polynomial interaction terms of degree two
- Transforming with PCA
- K-Means with partial_fit several times per model
Preamble - Imports
This cell show typical import statments for working with a elm.pipeline.steps
that become part of a Pipeline
, including importing a transformer and estimator from scikit-learn:

Steps - Flatten
¶
This transform-flatten step is essentially .ravel
on each DataArray
in X
to create a single 2-D DataArray
:

Steps - ModifySample
- set_nans
¶
The next step uses elm.pipeline.steps.ModifySample
to run a custom callable in a Pipeline
of transformations. This function sets NaN
for the no-data perimeters of the rasters:

Steps - DropNaRows
- Drop Null / NaN Rows¶
The transform-dropnarows is a transformer to remove the NaN
values from the DataArray
flat
(the flattened (ravel
) rasters as a single 2-D DataArray
)

Steps - ModifySample
- Log Transform (or pass through)¶
This usage of ModifySample
will allow the Pipeline
to use log transformation or not (see usage of set_params
several screenshots later)

Feature engineering in a Pipeline
¶
Define a function that can do normalized differences between bands (raster or DataArray
), adding the normalized differences to what will be the X
data in the Pipeline
of transformations.

Feature engineering - ModifySample
with arguments¶
And here is how the function above can be used in a Pipeline
(wrapping with elm.pipeline.steps.ModifySample
):
We are calculating:
NDWI
: Normalized Difference Water Index *(band_4 - band_5) / (band_4 + band_5)
NDVI
: Normalized Difference Vegetation Index *(band_5 - band_4) / (band_5 + band_4)
NDSI
: Normalized Difference SnowIndex *(band_2 - band_6) / (band_2 + band_6)
NBR
: Normalized Burn Ratio *(band_4 - band_7) / (band_7 + band_4)

Using pcolormesh on normalized differences of bands
Here are the NDWI
and NDVI
plotted with the `xarray-pcolormesh`_ method of the predict
DataArray

False Color with normalized differences of bands
The image below has an RGB (red, green, blue) matrix made up of the NBR
, NDSI
, NDWI
normalized differences:

Normalization and Adding Polynomial Terms¶
The following snippets show how to use a class from sklearn.preprocessing
or sklearn.feature_selection
with Pipeline
:


Custom Feature Selection
By defining the function below, we will be able to choose among random combinations of the original data or normalized differences

PCA¶
Use steps.Transform
to wrap PCA
or another method from sklearn.decomposition
for elm.pipeline.Pipeline
.

Use an estimator from scikit-learn
¶
Use a model with a fit
/ predict
interface, such as KMeans
.

Most scikit-learn models described here are supported.
Create Pipeline
instance¶
The following uses all the steps we have created in sequence of tuples and configures scoring for K-Means with the Akaike Information Criterion.

The next steps deal with controlling fit_ensemble (fitting with a group of models of different parameters)
See more info on Pipeline here.
ensemble_init_func
¶
This is an example ensemble_init_func
to pass to fit_ensemble, using pipe.new_with_params(**new_params)
to create a new unfitted Pipeline
instance with new parameters.

The fit_ensemble docs also show an example of an ensemble_init_func
.
More fit_ensemble
control¶
The following sets the number of generations ( ngen
) and the model_selection
callable after each generation.

Parallelism with dask-distributed
¶
fit_ensemble , to fit a group of models in generations with model selection after each generation, and predict_many each take a client
keyword as a dask Client
(dask). predict_many parallelizes over multiple models and samples, though here only one sample is used.


Using an ElmStore
from predict_many¶
predict_many has called transform-inverseflatten to reshape the 1-D numpy array from the sklearn.cluster.KMeans.predict
method to a 2-D raster with the coordinates of the original data. Note also the inverse_flatten
is typically able to preserve NaN
regions of the original data (the NaN
borders of this image are preserved).
Using the xarray’s pcolormesh on the predict
attribute ( DataArray
) of an ElmStore
returned by predict_many :

The best prediction in terms of AIC
:

Examples¶
This page provides examples of Python sessions using elm
code and yaml
config files that can be run with elm-main.
Prerequisites for Examples¶
Follow the instructions for installation of elm
and the elm-env
conda environment ( Install ):
Also, define the environment variable ELM_EXAMPLE_DATA_PATH
to be your full path to local clone of elm-data
Jupyter (IPython) Notebooks with elm
¶
Two notebooks provide worked out examples of using elm
with time series of spatial weather data from NetCDF files.
The notebook then goes through how to build true and false color images with an ElmStore and matplotlib.pyplot.imshow
Scripts using elm
¶
- api_example.py : A K-Means example with PCA preprocessing in ensemble
- api_example_evo.py : A K-Means example with feature selection in NSGA-2
Notebooks using elm
¶
- Clustering of temperature probability distributions in time
- Land cover clustering with K-Means, PCA, and other transformations
yaml
config files for elm-main
¶
Data Structures
ElmStore¶
ElmStore
, from elm.readers
, is a fundamental data structure in elm
and is the data structure used to pass arrays and metadata through each of the steps in an Pipeline (series of transformations). An ElmStore
is oriented around multi-band rasters and cubes stored in HDF4 / 5, NetCDF, or GeoTiff formats. ElmStore
is a light wrapper around xarray.Dataset
.
This page discusses:
See also API docs.
Creating an ElmStore
from File¶
An ElmStore
can be created from HDF4
/ HDF5
or NetCDF
file with load_array
from elm.readers
. The simple case is to load all bands or subdatasets from an HDF or NetCDF file:
from elm.readers import load_array
filename = '3B-HHR-E.MS.MRG.3IMERG.20160708-S153000-E155959.0930.V03E.HDF5.nc'
es = load_array(filename)
For GeoTiffs the argument is a directory name rather than a file name and each band is formed from individual GeoTiff files in the directory. The following is an example with LANDSAT GeoTiffs for bands 1 through 7:
In [1]: from elm.readers import BandSpec, load_array
In [2]: ls
LC80150332013207LGN00_B1.TIF LC80150332013207LGN00_B5.TIF
LC80150332013207LGN00_B2.TIF LC80150332013207LGN00_B6.TIF
LC80150332013207LGN00_B3.TIF LC80150332013207LGN00_B7.TIF
LC80150332013207LGN00_B4.TIF logfile.txt
In [3]: es = load_array('.')
In [4]: es.data_vars
Out[4]:
Data variables:
band_0 (y, x) uint16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
band_1 (y, x) uint16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
band_2 (y, x) uint16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
band_3 (y, x) uint16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
band_4 (y, x) uint16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
band_5 (y, x) uint16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
band_6 (y, x
The example above for GeoTiffs loaded the correct bands, but labeled them in a way that may be confusing downstream in the analysis. The following section shows how to control which bands are loaded and what they are named.
Controlling Which Bands Are Loaded¶
Use the band_specs
keyword to load_array
to
- Control which subdatasets, or bands typically, are loaded into the
ElmStore
and/or- To standardize the names of the bands (
DataArrays
) in theElmStore
.The
band_specs
work slightly differently for each file type:
- HDF4 / HDF5: The
band_specs
determine matching against one of the HDF4 file’s subdatasets (see also GDAL subdatasets).- NetCDF: The
band_specs
determine matching against one of the NetCDF file’svariables
metadata (NetCDF4 python variables interface).- GeoTiff: When calling
load_array
for GeoTiffs, the argument is a directory (of GeoTiff files) not a single GeoTiff file. Theband_specs
for a GeoTiff file determine matching based on the gdal metadata for each GeoTiff in the directory. GeoTiffs are read usingrasterio
, a wrapper around GDAl.
In simple cases band_specs
can be a list of strings to match a NetCDF
variable name, subdataset name, or GeoTiff file name, as shown below:
In [4]: from elm.readers import load_array
In [5]: filename = '3B-HHR-E.MS.MRG.3IMERG.20160708-S153000-E155959.0930.V03E.HDF5.nc'
In [6]: es = load_array(filename, band_specs=['HQobservationTime'])
In [7]: es.data_vars
Out[7]:
Data variables:
HQobservationTime (lon, lat) timedelta64[ns] NaT NaT NaT NaT NaT NaT ...
With GeoTiffs, giving a list of strings as band_specs
finds matching GeoTiff files (bands) by testing if each string is in
a GeoTiff file name of the directory. Here is an example:
from elm.readers import load_array
dir_of_tifs = '.'
load_array(dir_of_tifs, band_specs=["B1.TIF", "B2.TIF","B3.TIF"])
band_specs
can be given as a list of elm.readers.BandSpec
objects. The following shows an example of loading 4 bands from an HDF4
file where the band name, such as "Band 1 "
is found in the long_name
key/value of the subdataset (band) metadata and the band names are standardized to lower case with no spaces.
In [1]: from elm.readers import BandSpec, load_array
In [2]: band_specs = list(map(lambda x: BandSpec(**x),
[{'search_key': 'long_name', 'search_value': "Band 1 ", 'name': 'band_1'},
{'search_key': 'long_name', 'search_value': "Band 2 ", 'name': 'band_2'},
{'search_key': 'long_name', 'search_value': "Band 3 ", 'name': 'band_3'},
{'search_key': 'long_name', 'search_value': "Band 4 ", 'name': 'band_4'}]))
In [3]: filename = 'NPP_DSRF1KD_L2GD.A2015017.h09v05.C1_03001.2015018132754.hdf'
In [4]: es = load_array(filename, band_specs=band_specs)
In [5]: es.data_vars
Out[5]:
Data variables:
band_1 (y, x) uint16 877 877 767 659 920 935 935 918 957 989 989 789 ...
band_2 (y, x) uint16 899 899 770 659 954 973 973 935 994 1004 1004 841 ...
band_3 (y, x) uint16 1023 1023 880 781 1115 1141 1141 1082 1155 1154 ...
band_4 (y, x) uint16 1258 1258 1100 1009 1374 1423 1423 1341 1408 1405 ...
Note the BandSpec
objects could have also used the keyword arguments key_re_flags
and value_re_flags
with a list of flags passed to re for regular expression matching.
BandSpec
- File Reading Control¶
Here are a few more things a BandSpec
can do:
- A
BandSpec
can control the resolution at which a file is read (and improve loading speed). To control resolution when loading rasters, providebuf_xsize
andbuf_ysize
keyword arguments (integers) toBandSpec
.- A
BandSpec
can provide awindow
that subsets the file. See this rasterio demo that shows howwindow
is effectively interpreted inload_array
.- A
BandSpec
with ameta_to_geotransform
callable attribute can be used to construct ageo_transform
array from band metadata (e.g. when GDAL fails to detect thegeo_transform
accurately)- A
BandSpec
can control whether a raster is loaded with (“y”, “x”) pixel order (the default behavior that suits most top-left-corner based rasters) or (“x”, “y”) pixel order.
See also the definition of BandSpec
in elm.readers
showing all the recognized fields (snippet taken from elm.readers.util).
@attr.s
class BandSpec(object):
search_key = attr.ib()
search_value = attr.ib()
name = attr.ib()
key_re_flags = attr.ib(default=None)
value_re_flags = attr.ib(default=None)
buf_xsize = attr.ib(default=None)
buf_ysize = attr.ib(default=None)
window = attr.ib(default=None)
meta_to_geotransform = attr.ib(default=None)
stored_coords_order = attr.ib(default=('y', 'x'))
Creating an ElmStore
- Contructor¶
Here is an example of creating an ElmStore
from numpy
arrays and xarray.DataArrays
. In most ways, an ElmStore
is interchangeable with an xarray.Dataset
(see also docs on working with a Dataset.
from collections import OrderedDict
import numpy as np
import xarray as xr
from elm.readers import ElmStore
rand_array = lambda: np.random.normal(0, 1, 1000000).reshape(-1,10)
def sampler(**kwargs):
bands = ['b1', 'b2', 'b3', 'b4']
es_data = OrderedDict()
for band in bands:
arr = rand_array()
y = np.arange(arr.shape[0])
x = np.arange(arr.shape[1])
es_data[band] = xr.DataArray(arr, coords=[('y', y), ('x', x)], dims=('y', 'x'), attrs={})
return ElmStore(es_data, add_canvas=False)
Calling sampler
above gives:
<elm.ElmStore>
Dimensions: (x: 10, y: 100000)
Coordinates:
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* x (x) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
b1 (y, x) float64 1.772 -0.414 1.37 2.107 -1.306 0.9612 -0.0696 ...
b2 (y, x) float64 0.07442 1.908 0.5816 0.06838 -2.712 0.4544 ...
b3 (y, x) float64 -2.597 -1.893 0.05608 -0.5394 1.406 -0.6185 ...
b4 (y, x) float64 1.054 -1.522 -0.03202 -0.02127 0.02914 -0.6757 ...
Attributes:
_dummy_canvas: True
band_order: ['b1', 'b2', 'b3', 'b4']
ElmStore
has the initialization keyword argument add_canvas
that differs from xarray.Dataset
. If add_canvas
is True (default), it expected that the band metadata in the DataArrays
each contain a geo_transform
key with a value that is a sequence of length 6. See the GDAL data model for more information on geo transforms. In the example above each DataArray
did not have a geo_transform
in attrs
so add_canvas
was set to False
. The limitation of not having a canvas
attribute is inability to use some spatial reindexing transformations (e.g. elm.pipeline.steps.SelectCanvas
described further below)
Attributes of an ElmStore
¶
If an ElmStore
was initialized with add_canvas
(the behavior in load_array
), then it is expected each band, or DataArray
, will have a geo_transform
in its metadata. The geo_transform
information, in combination with the array dimensions and shape, create the ElmStore
’s canvas
attribute.
In [4]: es.canvas
Out[5]: Canvas(geo_transform=(-180.0, 0.1, 0, -90.0, 0, 0.1), buf_xsize=3600, buf_ysize=1800, dims=('lon', 'lat'), ravel_order='C', zbounds=None, tbounds=None, zsize=None, tsize=None, bounds=BoundingBox(left=-180.0, bottom=-90.0, right=179.90000000000003, top=89.9))
The canvas
is used in the Pipeline
for transformations like elm.pipeline.steps.SelectCanvas
which can be used to reindex all bands onto coordinates of one of the band’s in the ElmStore
.
An ElmStore
has a data_vars
attribute (inherited from xarray.Dataset
- described here), and also has an attribute band_order
. When elm.pipeline.steps.Flatten
flattens the separate bands of an ElmStore
, band_order
becomes the order of the bands in the single flattened 2-D array.
In [5]: filename = '3B-MO.MS.MRG.3IMERG.20160101-S000000-E235959.01.V03D.HDF5'
In [6]: es = load_array(filename)
In [7]: es.data_vars
Out[7]:
Data variables:
band_0 (y, x) int16 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 ...
band_1 (y, x) float32 -9999.9 -9999.9 -9999.9 -9999.9 -9999.9 -9999.9 ...
band_2 (y, x) int16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
band_3 (y, x) float32 -9999.9 -9999.9 -9999.9 -9999.9 -9999.9 -9999.9 ...
In [8]: es.band_order
Out[8]: ['band_0', 'band_1', 'band_2', 'band_3']
Common ElmStore
Transformations¶
Flatten
elm.pipeline.steps.Flatten
will convert an ElmStore
of 2-D rasters in bands (each band as a DataArray
) to an ElmStore
with a single DataArray
called flat
. *Note: elm.pipeline.steps.Flatten()
must be included in a Pipeline
before scikit-learn based transforms on an ElmStore
, where the scikit-learn transforms expect a 2-D array.
Here is an example of Flatten
that continues the example above that defined sampler
, a function returning a random ElmStore
of 2-D DataArray
bands:
es = sampler()
X_2d, y, sample_weight = steps.Flatten().fit_transform(es)
In [17]: X_2d.flat
Out[17]:
<xarray.DataArray 'flat' (space: 1000000, band: 4)>
array([[ 1.13465339, -0.1533531 , 1.72809878, -0.7746218 ],
[-0.12378515, -1.72588715, 0.07752273, -1.19004227],
[ 2.16456385, -0.58083733, 0.03706811, 0.26274225],
...,
[ 0.45586256, -1.87248571, 1.27793313, 0.19892153],
[ 2.11702651, -0.05300853, -0.92923591, -1.07152977],
[-0.10245425, -1.27150399, -1.48745754, 1.00873062]])
Coordinates:
* space (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
* band (band) <U2 'b1' 'b2' 'b3' 'b4'
Attributes:
old_dims: [('y', 'x'), ('y', 'x'), ('y', 'x'), ('y', 'x')]
_dummy_canvas: True
canvas: Canvas(geo_transform=(-180, 0.1, 0, 90, 0, -0.1), buf_xsize=10, buf_ysize=100000, dims=('y', 'x'), ravel_order='C', zbounds=None, tbounds=None, zsize=None, tsize=None, bounds=BoundingBox(left=-180.0, bottom=90.0, right=-179.1, top=-9909.900000000001))
old_canvases: [Canvas(geo_transform=(-180, 0.1, 0, 90, 0, -0.1), buf_xsize=10, buf_ysize=100000, dims=('y', 'x'), ravel_order='C', zbounds=None, tbounds=None, zsize=None, tsize=None, bounds=BoundingBox(left=-180.0, bottom=90.0, right=-179.1, top=-9909.900000000001)), Canvas(geo_transform=(-180, 0.1, 0, 90, 0, -0.1), buf_xsize=10, buf_ysize=100000, dims=('y', 'x'), ravel_order='C', zbounds=None, tbounds=None, zsize=None, tsize=None, bounds=BoundingBox(left=-180.0, bottom=90.0, right=-179.1, top=-9909.900000...
flatten_data_array: True
band_order: ['b1', 'b2', 'b3', 'b4']
InverseFlatten
elm.pipeline.steps.InverseFlatten
converts an ElmStore
that is flattened (typically the output of transform-flatten above) back to separate 2-D raster bands.
es = sampler()
X_2d, y, sample_weight = steps.Flatten().fit_transform(es)
restored, _, _ = steps.InverseFlatten().fit_transform(X_2d)
np.all(restored.b1.values == es.b1.values)
DropNaRows
elm.pipeline.steps.DropNaRows
is a transformer that will drop any null rows from an ElmStore
that has a DataArray
called flat
(see transform-flatten). It drops the null rows while keeping metadata to allow transform-inverseflatten in predict_many . An example usage of DropNaRows
is given in the K-Means LANDSAT ``elm` introduction<cluster_example>`
Here is an example of using DropNaRows
with the sampler
function defined above.
es = sampler()
X_2d, _, _ = steps.Flatten().fit_transform(es)
X_2d.flat.values[:2, :] = np.NaN
X_no_na, _, _ = steps.DropNaRows().fit_transform(X_2d)
assert X_no_na.flat.shape[0] == X_2d.flat.shape[0] - 2
restored = inverse_flatten(X_no_na)
assert restored.b1.shape == es.b1.shape
val = restored.b1.values
assert val[np.isnan(val)].size == 2
Agg
Aggregation along a dimension can be done with elm.pipeline.steps.Agg
, referencing either a dim
or axis
.
In [44]: es = sampler()
In [45]: agged, _, _ = steps.Agg(dim='y', func='median').fit_transform(es)
In [46]: agged
Out[46]:
ElmStore:
<elm.ElmStore>
Dimensions: (x: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
b1 (x) float64 -0.00231 -0.00294 -0.002797 0.002472 -0.006088 ...
b2 (x) float64 8.965e-06 0.0001929 -0.007133 0.001447 -0.001846 ...
b3 (x) float64 -0.0009686 -0.003632 -0.0007322 -0.002221 -0.0039 ...
b4 (x) float64 0.00667 0.001018 0.002702 0.009274 0.001481 ...
Attributes:
_dummy_canvas: True
band_order: ['b1', 'b2', 'b3', 'b4']
In the example above, median
could have been replaced by any of the following:
all
any
argmax
argmin
max
mean
median
min
prod
sum
std
var
Read more on the implementation of the functions above in the xarray.DataArray methods here.
ElmStore
and Metadata¶
This section describes elm
functions useful for deriving information from file metadata.
set_na_from_meta: This function searches the attrs
of each DataArray
in an ElmStore
or xarray.Dataset
and sets NaN
values in each DataArray
where metadata indicates it is necessary. Currently set_na_from_meta
searches attrs
for the following keys using a case-, space- and punctuation-insenstive regular expression:
missing_value
: Any values in theDataArray
equal to the missing value will be set toNaN
.valid_range
andinvalid_range
: Ifattrs
have a key likevalid_range
orinvalid_range
, the function will check to see if it is a sequence of length 2 or a string that can be split on comma or spaces to form a sequnce of length 2. If a sequence of length 2, then the invalid / valid ranges will be used to setNaN
values appropriately.
from elm.readers.tests.util import HDF4_FILES
from elm.readers import load_array, set_na_from_meta
es = load_array(HDF4_FILES[0])
set_na_from_meta(es) # modifies ElmStore instance in place
meta_is_day: This function takes a single argument, a dict that is typically the attrs
of an ElmStore
, and searches for keys/values indicating whether the attrs
correspond to a day or night sample.
from elm.readers.tests.util import HDF4_FILES
from elm.readers import load_array
from elm.sample_util.metadata_selection import example_meta_is_day
from scipy.stats import describe
es3 = load_array(HDF4_FILES[0])
es3.DayNightFlag # prints "Day"
meta_is_day(es3) # prints True
Pipeline¶
Overview of Pipeline
in elm
¶
elm.pipeline.Pipeline
allows a sequence of transformations on samples before fitting, transforming, and/or predicting from an scikit-learn estimator. elm.pipeline.Pipeline
is similar to the concept of the Pipeline
in scikit-learn (sklearn.pipeline.Pipeline
) but differs in several ways described below.
- Data sources for a Pipeline: In
elm
, the fitting expectsX
to be an ElmStore or xarray.Dataset rather than anumpy
array as in scikit-learn. This allows thePipeline
of transformations to include operations on cubes and other data structures common in satellite data machine learning. - Transformations: In scikit-learn each step in a
Pipeline
passes a numpy array to the next step by way of afit_transform
method. Inelm
, aPipeline
always passes a tuple of (X, y, sample_weight) where X is an ElmStore orxarray.Dataset
andy
andsample_weight
are numpy arrays orNone
. - Partial Fit for Large Samples: In
elm
a transformer with apartial_fit
method, such assklearn.decomposition.IncrementalPCA
may be partially fit several times as a step in aPipeline
and the final estimator may also usepartial_fit
several times withdask-distributed
for parallelization. - Multi-Model / Multi-Sample Fitting: In
elm
, aPipeline
can be fit with: - fit_ensemble: This method repeats model fitting over a series of samples and/or a ensemble of
Pipeline
instances. ThePipeline
instances in the ensemble may or may not have the same initialization parameters. fit_ensemble can run in generations, optionally applying user-given model selection logic between generations. This fit_ensemble method is aimed at improved model fitting in cases where a representative sample is large and/or there is a need to account for parameter uncertainty. - fit_ea: This method uses Distributed Evolutionary Algorithms in Python (
deap
) to run a genetic algorithm, typically NSGA-2, that selects the bestPipeline
instance(s). The interface for fit_ea and fit_ensemble are similar, but fit_ea takes anevo_params
argument to configure the genetic algorithm.
- fit_ensemble: This method repeats model fitting over a series of samples and/or a ensemble of
- Multi-Model / Multi-Sample Fitting: In
- Multi-Model / Multi-Sample Prediction:
elm
’sPipeline
has a method predict_many that can use dask-distributed to predict from one or morePipeline
instances and/or one or more samples (ElmStore will predict for all models in the final ensemble output by fit_ensemble.
The following discusses each step of making a Pipeline
that uses most of the features described above.
Data Sources for a Pipeline
¶
Pipeline
can be used for fitting / transforming / predicting from a single sample or series of samples. For the fit_ensemble, fit_ea or predict_many methods of aPipeline
instance:- To fit to a single sample, use the
X
keyword argument, and optionallyy
andsample_weight
keyword arguments. - To fit to a series of samples, use the
args_list
andsampler
keyword arguments.
- To fit to a single sample, use the
If X
is given it is assumed to be an ElmStore or xarray.Dataset`
If sampler
is given with args_list
, then each element of args_list
is unpacked as arguments to the callable sampler
. There is a special case of giving sampler
as elm.readers.band_selection.select_from_file
which allows using the functions from elm.readers
for reading common formats and selecting bands from files (the band_specs
argument). Here is an example that uses select_from_file
to load multi-band HDF4
arrays:
from elm.readers import BandSpec
from elm.readers.metadata_selection import meta_is_day
band_specs = list(map(lambda x: BandSpec(**x),
[{'search_key': 'long_name', 'search_value': "Band 1 ", 'name': 'band_1'},
{'search_key': 'long_name', 'search_value': "Band 2 ", 'name': 'band_2'},
{'search_key': 'long_name', 'search_value': "Band 3 ", 'name': 'band_3'},
{'search_key': 'long_name', 'search_value': "Band 4 ", 'name': 'band_4'},
{'search_key': 'long_name', 'search_value': "Band 5 ", 'name': 'band_5'},
{'search_key': 'long_name', 'search_value': "Band 6 ", 'name': 'band_6'},
{'search_key': 'long_name', 'search_value': "Band 7 ", 'name': 'band_7'},
{'search_key': 'long_name', 'search_value': "Band 9 ", 'name': 'band_9'},
{'search_key': 'long_name', 'search_value': "Band 10 ", 'name': 'band_10'},
{'search_key': 'long_name', 'search_value': "Band 11 ", 'name': 'band_11'}]))
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
if meta_is_day(load_hdf4_meta(f))]
data_source = {
'sampler': select_from_file,
'band_specs': band_specs,
'args_list': HDF4_FILES,
}
Alternatively, to train on a single HDF4 file, we could have done:
from elm.readers import load_array
from elm.sample_util.metadata_selection import example_meta_is_day
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
if example_meta_is_day(load_hdf4_meta(f))]
data_source = {'X': load_array(HDF4_FILES[0], band_specs=band_specs)}
Transformations¶
A Pipeline
is created by giving a list of steps - the steps before the final step are known as transformers and the final step is the estimator. See also the full docs on elm.pipeline.steps.
Here is an example Pipeline
of transformations before K-Means
from elm.pipeline import steps, Pipeline
pipeline_steps = [steps.Flatten(),
('scaler', steps.StandardScaler()),
('pca', steps.Transform(IncrementalPCA(n_components=4), partial_fit_batches=2)),
('kmeans', MiniBatchKMeans(n_clusters=4, compute_labels=True)),]
The example above calls:
steps.Flatten
first (See transformers-flatten) first, as utility for flattening our multi-band raster HDF4 sample(s) into an ElmStore with a single xarray.DataArray, calledflat
, with each band as a column inflat
.- StandardScaler with default arguments from
sklearn.prepreprocessing
(all other transformers from sklearn.preprocessing and sklearn.feature_selection are also attributes ofelm.pipeline.steps
and could be used here) - PCA with
elm.pipeline.steps.Transform
to wrap scikit-learn transformers to allow multiple calls topartial_fit
within a single fitting task of the final estimator -steps.Transform
is initialized with:- A scikit-learn transformer as an argument
partial_fit_batches
as a keyword, defaulting to 1. Note: usingpartial_fit_batches != 1
requires a transformer with apartial_fit
method
- Finally MiniBatchKMeans
Multi-Model / Multi-Sample Fitting¶
There are two multi-model approaches to fitting that can be used with a Pipeline
: fit_ensemble or fit_ea. The examples above with a data source to a Pipeline
and the transformation steps within one Pipeline
instance work similarly in fit_ensemble and fit_ea.
- Other similarities between fit_ea and fit_ensemble include the following common keyword arguments:
scoring
a callable with a signature likeelm.model_selection.kmeans.kmeans_aic
(See API docs ) or a string likef_classif
attribute name fromsklearn.metrics
scoring_kwargs
kwargs passed to thescoring
callable if neededsaved_ensemble_size
an integer indicating how manyPipeline
estimators to retain in the final ensemble
- Read more on controlling ensemble or evolutionary algorithm approaches to fitting:
Multi-Model / Multi-Sample Prediction¶
After fit_ensemble or fit_ea has been called on a Pipeline
instance, the instance will have the attribute ensemble
a list of (tag, pipeline) tuples which are the final Pipeline
instances selected by either of the fitting functions (see also saved_ensemble_size
- See Controlling Ensemble Initialization). With a fitted Pipeline
instance, predict_many can be called on the instance to predict from every ensemble member (Pipeline
instance) on a single X
sample or from every ensemble member and every sample if sampler
and args_list
are given in place of X
.
Read more on controlling predict_many.
elm.pipeline.steps
¶
The examples below assume you have created a random ElmStore as follows:
from elm.pipeline.tests.util import random_elm_store
X = random_elm_store()
Operations to reshape an ElmStore¶
Flatten
- Flatten each 2-DDataArray
in an ElmStore to create an ElmStore with a singleDataArray
calledflat
that is 2-D (each band raster is raveled from 2-D to a 1-D column inflat
). Example:
steps.Flatten().fit_transform(X)
Agg
- Aggregate over a dimension or axis. Example:
steps.Agg(axis=0, func='mean').fit_transform(X)
DropNaRows
- Remove null / NaN rows from an ElmStore that has been throughsteps.Flatten()
:
steps.DropNaRows().fit_transform(*steps.Flatten().fit_transform(X))
InverseFlatten
- Convert a flattened ElmStore back to 2-D rasters as separateDataArray
values in an ElmStore. Example:
steps.InverseFlatten().fit_transform(*steps.Flatten().fit_transform(X)
Use an unsupervised feature extractor¶
Transform
-steps.Transform
allows one to use any sklearn.decomposition method in anelm
Pipeline. Partial fit of the feature extractor can be accomplished by givingpartial_fit_batches
at initialization:
from sklearn.decomposition import IncrementalPCA
X, y, sample_weight = steps.Flatten().fit_transform(X)
pca = steps.Transform(IncrementalPCA(),
partial_fit_batches=2)
pca.fit_transform(X)
Run a user-given callable¶
There are two choices for running a user-given callable in a Pipeline . Using ModifySample
is the most general, taking any shape of X
, y
and sample_weight
arguments, while FunctionTransformer
requires that the ElmStore has been through steps.Flatten()
ModifySample
- The following shows an example function with the required signature for use withModifySample
. It divides all the values in eachDataArray
by their sum. Note the function always returns a tuple of(X, y, sample_weight)
, even ify
andsample_weight
are not used by the function:
def modifier(X, y=None, sample_weight=None, **kwargs):
for band in X.data_vars:
arr = getattr(X, band)
if kwargs.get('normalize'):
arr.values /= arr.values.max()
return X, y, sample_weight
steps.ModifySample(modifier, normalize=True).fit_transform(X)
FunctionTransformer
- Here is an example using the FunctionTransformer fromsklearn
:
import numpy as np
Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
Xnew, y, sample_weight = steps.FunctionTransformer(func=np.log).fit_transform(Xnew)
Preprocessing - Scaling / Normalization¶
Each of the following classes from scikit-learn have been wrapped for usage as a Pipeline step. Each requires that the ElmStore
The examples below continue with Xnew
a flattened ElmStore :
Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
KernelCenterer
- See also KernelCenterer scikit-learn docs.
steps.KernelCenterer().fit_transform(Xnew)
MaxAbsScaler
- See also MaxAbsScaler scikit-learn docs.
steps.MaxAbsScaler().fit_transform(*steps.Flatten().fit_transform(X))
MinMaxScaler
- See also MinMaxScaler scikit-learn docs.
steps.MinMaxScaler().fit_transform(Xnew)
Normalizer
- See also Normalizer scikit-learn docs.
steps.Normalizer().fit_transform(Xnew)
RobustScaler
- See also RobustScaler scikit-learn docs.
steps.RobustScaler().fit_transform(Xnew)
PolynomialFeatures
- See also PolynomialFeatures scikit-learn docs.
step = steps.PolynomialFeatures(degree=3,
interaction_only=False)
step.fit_transform(Xnew)
StandardScaler
- See also StandardScaler scikit-learn docs.
steps.StandardScaler().fit_transform(Xnew)
Encoding Preprocessors from sklearn
¶
Each method here requires that the ElmStore has been through steps.Flatten()
as follows:
Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
Binarizer
- Binarize features. See also Binarizer docs fromsklearn
.
steps.Binarizer().fit_transform(Xnew)
Imputer
- Impute missing values. See also Imputer docs fromsklearn
.
steps.Imputer().fit_transform(Xnew)
Feature selectors¶
The following list shows the feature selectors that may be used in a Pipeline . The methods, with the exception of VarianceThreshold
each require y
to be not None
.
Setup for the examples:
X, y = random_elm_store(return_y=True)
X = steps.Flatten().fit_transform(X)[0]
RFE
- See also RFE insklearn
docs. Example:
steps.RFE(estimator=LinearRegression()).fit_transform(X, y)
RFECV
- See also RFECV insklearn
docs. Example:
steps.RFECV(estimator=LinearRegression()).fit_transform(X, y)
SelectFdr
- See also SelectFdr insklearn
docs. Example:
steps.SelectFdr().fit_transform(X, y)
SelectFpr
- See also SelectFpr insklearn
docs. Example:
steps.SelectFpr().fit_transform(X, y)
SelectFromModel
- See also SelectFromModel insklearn
docs. Example:
steps.SelectFromModel(estimator=LinearRegression()).fit_transform(X, y)
SelectFwe
- See also SelectFwe insklearn
docs. Example:
steps.SelectFwe().fit_transform(X, y)
SelectKBest
- See also SelectKBest insklearn
docs. Example:
steps.SelectKBest(k=2).fit_transform(X, y)
SelectPercentile
- See also SelectPercentile insklearn
docs. Example:
steps.SelectPercentile(percentile=50).fit_transform(X, y)
VarianceThreshold
- See also VarianceThreshold insklearn
docs. Example:
steps.VarianceThreshold(threshold=6.92).fit_transform(X)
Normalizing time dimension of 3-D Cube¶
The following two functions take an ElmStore with a DataArray
of any name that is a 3-D cube with a time dimension. The functions run descriptive stats along the time dimension and flatten the spatial (x, y)
dims to space (essentially a ravel
of the (x, y)
points).
Setup - make a compatible ElmStore:
from elm.readers import ElmStore
import numpy as np
import xarray as xr
def make_3d():
arr = np.random.uniform(0, 1, 100000).reshape(100, 10, 100)
return ElmStore({'band_1': xr.DataArray(arr,
coords=[('time', np.arange(100)),
('x', np.arange(10)),
('y',np.arange(100))],
dims=('time', 'x', 'y'),
attrs={})}, attrs={}, add_canvas=False)
X = make_3d()
TSDescribe
- Runscipy.stats.describe
and other stats along the time axis of a 3-D cubeDataArray
. Example:
s = steps.TSDescribe(band='band_1', axis=0)
Xnew, y, sample_weight = s.fit_transform(X)
Xnew.flat.band
The above code would show the band
dimension of Xnew
consists of different summary statistics, mostly from scipy.stats.describe
:
<xarray.DataArray 'band' (band: 8)>
array(['var', 'skew', 'kurt', 'min', 'max', 'median', 'std', 'np_skew'],
dtype='<U7')
Coordinates:
* band (band) <U7 'var' 'skew' 'kurt' 'min' 'max' 'median' 'std' 'np_skew'
TSProbs
-TSProbs
will run bin, count and return probabilities associated with bin counts. An example:
fixed_bins = steps.TSProbs(band='band_1',
bin_size=0.5,
num_bins=152,
log_probs=True,
axis=0)
Xnew, y, sample_weight = fixed_bins.fit_transform(X)
The above would create the DataArray
Xnew.flat
with 152 columns consisting of the log
transformed bin probabilities (152 bins of 0.5 width).
And the following would use irregular ( numpy.histogram
) bins rather than fixed bins and return probabilities without log
transform first:
irregular_bins = steps.TSProbs(band='band_1',
num_bins=152,
log_probs=False,
axis=0)
Xnew, y, sample_weight = irregular_bins.fit_transform(X)
Multi-Model Fitting
Fit Ensemble¶
Ensemble fitting may be helpful in cases where the representative sample is large and/or model parameter or fitting uncertainty should be considered.
Ensemble fitting may:
- Use one or more samples,
- Use one or more models (Pipeline instances), and/or
- Use one or more generations of fitting, with model selection logic on each generation
It is helpful to first read the section Data Sources for a Pipeline showing how to use either a single X
matrix or a series of samples from a sampler
callable.
Define a Sampler¶
The example below uses a sampler
function and args_list
(list of unpackable args to sampler
) to fit to many samples. The full script can be found here. First the script does some imports and sets up a sampler
function that uses band_specs
(see also ElmStore) to select a subset of bands in HDF4
files.
import os
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA
import numpy as np
from elm.config.dask_settings import client_context
from elm.model_selection.kmeans import kmeans_model_averaging, kmeans_aic
from elm.pipeline import steps, Pipeline
from elm.readers import *
from elm.sample_util.band_selection import select_from_file
from elm.sample_util.metadata_selection import example_meta_is_day
ELM_EXAMPLE_DATA_PATH = os.environ['ELM_EXAMPLE_DATA_PATH']
band_specs = list(map(lambda x: BandSpec(**x),
[{'search_key': 'long_name', 'search_value': "Band 1 ", 'name': 'band_1'},
{'search_key': 'long_name', 'search_value': "Band 2 ", 'name': 'band_2'},
{'search_key': 'long_name', 'search_value': "Band 3 ", 'name': 'band_3'},
{'search_key': 'long_name', 'search_value': "Band 4 ", 'name': 'band_4'},
{'search_key': 'long_name', 'search_value': "Band 5 ", 'name': 'band_5'},
{'search_key': 'long_name', 'search_value': "Band 6 ", 'name': 'band_6'},
{'search_key': 'long_name', 'search_value': "Band 7 ", 'name': 'band_7'}]))
# Just get daytime files
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
if example_meta_is_day(load_hdf4_meta(f))]
data_source = {
'sampler': select_from_file,
'band_specs': band_specs,
'args_list': HDF4_FILES,
}
Define Pipeline
Steps¶
Next a Pipeline is configured that flattens the separate band rasters to a single 2-D DataArray
(See also transform-flatten, uses standard scaling from scikit-learn, then transforms with IncrementalPCA
with 2 partial_fit
batches before K-Means. The Pipeline constructor also takes a scoring
callable and optional scoring_kwargs
.
pipeline_steps = [steps.Flatten(),
('scaler', steps.StandardScaler()),
('pca', steps.Transform(IncrementalPCA(n_components=4), partial_fit_batches=2)),
('kmeans', MiniBatchKMeans(n_clusters=4, compute_labels=True)),]
pipe = Pipeline(pipeline_steps, scoring=kmeans_aic, scoring_kwargs=dict(score_weights=[-1]))
See the signature for kmeans_aic here to write a similar scoring function, otherwise scoring
defaults to calling the estimator’s .score
callable or exception if .score
is not defined.
Configure Ensemble¶
Now we can call fit_ensemble
after choosing some controls on the size of the ensemble, the number of generations, and the logic for selecting models after each generation.
Here’s an example:
ensemble_kwargs = {
'model_selection': kmeans_model_averaging,
'model_selection_kwargs': {
'drop_n': 2,
'evolve_n': 2,
},
'init_ensemble_size': 4,
'ngen': 3,
'partial_fit_batches': 2,
'saved_ensemble_size': 4,
'models_share_sample': True,
}
In the example above:
ngen
sets the number of generations to 3- There are 4 initial ensemble members (
init_ensemble_size
),- After each generation
kmeans_model_averaging
(See API docs) is called on the ensemble withmodel_selection_kwargs
are passed.- There are 3
partial_fit
batches forMiniBatchKMeans
on every Pipeline instance (partial_fit
within theIncrementalPCA
was configured in the initialization ofsteps.Transform
above)models_share_sample
is set toTrue
so in each generation every ensemble member is fit to the same sample, then on the next generation, every model is fit to the next sample determined bysampler
andargs_list
in this case. Ifmodels_share_sample
wereFalse
, then in each generation every ensemble member would be copied and fit to every sample, repeating the process on each generation.
Fitting with Dask-Distributed¶
In the snippets above, we have a data_source
dict
with sampler
,``band_specs`` and args_list
key / values. We can pass this with the ensemble_kwargs
ensemble configuration to fit_ensemble
as well as predict_many . The data source for predict_many does not necessarily have to be the same one given to fit_ensemble
or fit_ea
).
Note : If you want dask-distributed
as a client, first make sure you are running a dask-scheduler
and dask-worker
. Read more here on dask-distributed and follow instructions in environment variables .
with client_context() as client:
ensemble_kwargs['client'] = client
pipe.fit_ensemble(**data_source, **ensemble_kwargs)
pred = pipe.predict_many(client=client, **data_source)
Fitting with dask
parallelizes over the ensemble members (Pipeline instances) and over the calls to partial_fit
- currently transformers in the Pipeline
are not parallelized with dask
.
Controlling Ensemble Initialization¶
To initialize the ensemble with Pipeline instances that do not all share the same parameters (as above), we could replace init_ensemble_size
above with ensemble_init_func
n_clusters_choices = tuple(range(4, 9))
def ensemble_init_func(pipe, **kwargs):
models = []
for c in n_clusters_choices:
new_pipe = pipe.new_with_params(kmeans__n_clusters=c)
models.append(new_pipe)
return models
ensemble_kwargs = {
'model_selection': kmeans_model_averaging,
'model_selection_kwargs': {
'drop_n': 2,
'evolve_n': 2,
},
'ensemble_init_func': ensemble_init_func,
'ngen': 3,
'partial_fit_batches': 2,
'saved_ensemble_size': 4,
'models_share_sample': True,
}
with client_context() as client:
ensemble_kwargs['client'] = client
pipe.fit_ensemble(**data_source, **ensemble_kwargs)
pred = pipe.predict_many(client=client, **data_source)
In the example above, Pipeline.new_with_params(kmeans__n_clusters)
uses the scikit-learn syntax for parameter modifications of named steps in a pipeline. In the initialization of Pipeline in the example above, the MiniBatchMeans
step was named kmeans
, so kmeans__n_clusters=c
sets the n_clusters
parameter to the K-Means step and the ensemble in this case consists of one Pipeline for each of n_clusters
choices in (4, 5, 6, 7, 8).
Fit EA¶
elm
can use an evolutionary algorithm for hyperparameterization. This involves using the fit_ea
method of Pipeline. It is helpful at this point to first read about Pipeline and how to configure a data source for the multi-model approaches in elm
. That page summarizes how fit_ea and fit_ensemble may be fit to a single X
matrix (when the keyword X
is given) or a series of samples (when sampler
and args_list
are given).
The example below walks through configuring an evolutionary algorithm to select the best K-Means model with preprocessing steps inclusive of standard scaling and PCA. First it sets up a sampler from HDF4 files (note the set up of a data source is the same as in fit_ensemble)
Example¶
import os
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
from elm.config.dask_settings import client_context
from elm.model_selection.evolve import ea_setup
from elm.model_selection.kmeans import kmeans_model_averaging, kmeans_aic
from elm.pipeline import Pipeline, steps
from elm.readers import *
from elm.sample_util.band_selection import select_from_file
from elm.sample_util.metadata_selection import example_meta_is_day
ELM_EXAMPLE_DATA_PATH = os.environ['ELM_EXAMPLE_DATA_PATH']
band_specs = list(map(lambda x: BandSpec(**x),
[{'search_key': 'long_name', 'search_value': "Band 1 ", 'name': 'band_1'},
{'search_key': 'long_name', 'search_value': "Band 2 ", 'name': 'band_2'},
{'search_key': 'long_name', 'search_value': "Band 3 ", 'name': 'band_3'},
{'search_key': 'long_name', 'search_value': "Band 4 ", 'name': 'band_4'},
{'search_key': 'long_name', 'search_value': "Band 5 ", 'name': 'band_5'},
{'search_key': 'long_name', 'search_value': "Band 6 ", 'name': 'band_6'},
{'search_key': 'long_name', 'search_value': "Band 7 ", 'name': 'band_7'}]))
# Just get daytime files
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
if example_meta_is_day(load_hdf4_meta(f))]
data_source = {
'sampler': select_from_file,
'band_specs': band_specs,
'args_list': HDF4_FILES,
}
Next the example sets up a Pipeline of transformations
def make_example_y_data(X, y=None, sample_weight=None, **kwargs):
fitted = MiniBatchKMeans(n_clusters=5).fit(X.flat.values)
y = fitted.predict(X.flat.values)
return (X, y, sample_weight)
pipeline_steps = [steps.Flatten(),
steps.ModifySample(make_example_y_data),
('top_n', steps.SelectPercentile(percentile=80,score_func=f_classif)),
('kmeans', MiniBatchKMeans(n_clusters=4))]
pipeline = Pipeline(pipeline_steps, scoring=kmeans_aic, scoring_kwargs=dict(score_weights=[-1]))
The example above uses elm.pipeline.steps.ModifySample
to return a y
data set corresponding to X
ElmStore
so that the example can show SelectPercentile
for feature selection.
Next evo_params
need to be called by passing a param_grid
dict to elm.model_selection.evolve.ea_setup
. The param_grid
uses scikit-learn syntax for parameter replacement (i.e. a named step like “kmeans” then a double underscore then a parameter name for that step [“n_clusters”]), so this param_grid
could potentially run models with n_clusters
in range(3, 10)
and percentile
in range(20, 100, 5)
. The control
dict sets parameters for the evolutionary algorithm (described below).
param_grid = {
'kmeans__n_clusters': list(range(3, 10)),
'top_n__percentile': list(range(20, 100, 5)),
'control': {
'select_method': 'selNSGA2',
'crossover_method': 'cxTwoPoint',
'mutate_method': 'mutUniformInt',
'init_pop': 'random',
'indpb': 0.5,
'mutpb': 0.9,
'cxpb': 0.3,
'eta': 20,
'ngen': 2,
'mu': 4,
'k': 4,
'early_stop': {'abs_change': [10], 'agg': 'all'},
# alternatively early_stop: {percent_change: [10], agg: all}
# alternatively early_stop: {threshold: [10], agg: any}
}
}
evo_params = ea_setup(param_grid=param_grid,
param_grid_name='param_grid_example',
score_weights=[-1]) # minimization
Running with dask
to parallelize over the individual solutions (Pipeline instances) and their calls to partial_fit
.
Note : If you want dask-distributed
as a client, first make sure you are running a dask-scheduler
and dask-worker
. Read more here on dask-distributed and follow instructions in environment variables .
with client_context() as client:
fitted = pipeline.fit_ea(evo_params=evo_params,
client=client,
**data_source)
preds = pipeline.predict_many(client=client, **data_source)
Reference param_grid
- control
¶
In the example above the param_grid
has a control
dictionary specifying parameters of the evolutionary algorithm. The control
dict names the functions to be used for crossover, mutation, and selection, and the other arguments are passed to the those methods as needed. The following section describes each key/value of a control
dictionary.
Note While it is possible to change the select_method
, crossover_method
and mutate_method
below from the example shown, it is important to use methods that are consistent with how fit_ea
expresses parameter choices. For each parameter in the param_grid
, such as kmeans__n_clusters=list(range(3, 10))
, fit_ea
optimizes with indices into kmeans__n_clusters
list, i.e. choosing among list(range(7))
, not optimizing an integer parameter between 3 and 10. This allows fit_ea
to avoid custom treatment of string, float, or integer data types in the parameters’ lists of choices. If changing the mutate_method
keep in mind that it needs to take individuals that are sequences of integers as arguments and return the same.
- select_method: Selection method on each generation of evolutionary algorithm. The selection method is typically
selNSGA2
but can be anydeap.tools
selection method (see the `list of selection methods here`_)- crossover_method: Crossover method between two individuals, e.g.
cxTwoPoint
, or any crossover method from deap.tools- mutate_method: Mutation method, typically
mutUniformInt
, or another mutation method fromdeap.tools
mutation methods- init_pop: Placeholder for initialization features- must always be
random
(random initialization of solutions)- indpb: Proability each attribute (feature) is mutated when an individual is mutated, e.g.
0.5
(passed to mutation methods indeap.tools
)- mutpb: When two individuals crossover, this is the probability they will mutate immediately after crossover, e.g.
0.9
- cxpb: Probabity of crossover
0.3
- eta: Tuning parameter in NSGA-2 - passed to mutate and mate methods. With a higher
eta
crowding is penalized and offspring are more different from their parents- ngen: Number of generations in genetic algorithm
- mu: Size of the population of solutions (individuals) initially
- k: Select the top
k
on each generation- early_stop: Control stopping of algorithm before
ngen
number of generations is completed. Examples are below (noteagg
refers to aggregation asall
orany
in the case it is a multi-objective problem)
- Stop on absolute change in objective:
{'abs_change': [10], 'agg': 'all'}
- Stop on percent change in objective:
early_stop: {percent_change: [10], agg: all}
- Stop on reaching objective threshold:
early_stop: {threshold: [10], agg: any}
More Reading¶
fit_ea
relies on deap
for Pareto sorting and the genetic algorithm components described above. Read more about deap
:
Multi-Model Prediction
Example with predict_many¶
elm
’s predict_many predicts for each estimator in a trained ensemble for one or more samples. predict_many takes some of the same data source keyword arguments that fit_ea and fit_ensemble use. See also Data Sources for a Pipeline - it discusses using a single sample by giving the keyword arguments X
or giving a sampler
and args_list
(list of unpackable args to the sampler
callable). The same logic applies for predict_many.
predict_many has a feature to_cube
argument that is useful in prediction for spatial data. to_cube=True
(True
by default) means to convert the 1-D numpy array of predictions from the estimator of a Pipeline instance to a 2-D raster with the coordinates of the input data before the input data were flattened for training. This makes it easy to make xarray-pcolormesh plots of predictions in spatial coordinates that are derived from models trained on spatial satellite and weather data.
Example - SGDClassifier
¶
The following example shows fitting a stochastic gradient descent classifier in ensemble with partial_fit
, varying the alpha
and penalty
parameters to sklearn.linear_model.SGDClassifier
and finally predicting from the best models of the ensemble over several input samples.
Import from elm
and sklearn
¶
This is a common set of import
statements when working with elm
from collections import OrderedDict
from elm.pipeline import Pipeline, steps
from elm.readers import *
from sklearn.datasets import make_blobs
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
import numpy as np
import xarray as xr
Define model selection¶
We can define a callable with a signature like model_selection
below to control which models are passed from generation to generation in an ensemble. This function just uses best_idxes
(Pareto sorted model fitness from the accuracy_score
):
def model_selection(models, best_idxes=None, **kwargs):
top_n = kwargs['top_n']
return [models[idx] for idx in best_idxes[:top_n]]
See also model_selection
in Controlling Ensemble Initialization.
Define initial ensemble¶
To vary the parameters of the initial ensemble of Pipeline instances, provide an ensemble_init_func
. pipe.new_with_params
is used here to create a variety of Pipeline objects that have different SGDClassifier
alpha
and penalty
parameters.
def ensemble_init_func(pipe, **kwargs):
models = []
for penalty in ('l1', 'l2'):
for alpha in (0.0001, 0.001, 0.01):
new_pipe = pipe.new_with_params(sgd__penalty=penalty, sgd__alpha=alpha)
models.append(new_pipe)
return models
See also ensemble_init_func
in Controlling Ensemble Initialization.
Control partial_fit
and ensemble¶
The following dict
are keywords to pass to fit_ensemble, including setting the number of generations ngen
, using partial_fit
twice per fitting of each model, and retaining finally the 2 best models (saved_ensemble_size
). Note also that partial_fit
requires giving the keyword argument classes
, a sequence of all known classes, so this is passed via method_kwargs
:
ensemble_kwargs = {
'model_selection': model_selection,
'model_selection_kwargs': {
'top_n': 2,
},
'ensemble_init_func': ensemble_init_func,
'ngen': 3,
'partial_fit_batches': 2,
'saved_ensemble_size': 2,
'method_kwargs': {'classes': np.arange(5)},
'models_share_sample': True,
}
See also ensemble_kwargs
in Controlling Ensemble Initialization.
Define a sampler
¶
The following lines of code use the synthetic data helper make_blobs
from sklearn.datasets
to create an ElmStore
with 5 bands (each band is a DataArray
)
rand_X_y = lambda n_samples: make_blobs(centers=[[1,2,3,4,5], [2,3,6,8,9], [3,4,5,10,12]], n_samples=n_samples)
def sampler_train(width, height, **kwargs):
X, y = rand_X_y(width * height)
bands = ['band_{}'.format(idx + 1) for idx in range(X.shape[1])]
es_data = OrderedDict()
for idx, band in enumerate(bands):
arr = xr.DataArray(X[:, idx].reshape(height, width),
coords=[('y', np.arange(height)),
('x', np.arange(width))],
dims=('y', 'x'))
es_data[band] = arr
# No geo_transform in attrs of arr, so add_canvas = False
es = ElmStore(es_data, add_canvas=False)
sample_weight = None
return es, y, sample_weight
Testing out sampler_train
:
In [42]: X, y, _ = sampler_train(10, 12)
In [43]: X, y
Out[43]:
(ElmStore:
<elm.ElmStore>
Dimensions: (x: 10, y: 12)
Coordinates:
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11
* x (x) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
band_1 (y, x) float64 0.5343 -1.21 1.241 2.191 3.364 2.115 3.579 3.086 ...
band_2 (y, x) float64 3.657 3.575 1.164 4.786 4.354 3.74 1.924 3.674 ...
band_3 (y, x) float64 4.909 2.258 2.761 4.313 5.379 4.145 6.515 5.137 ...
band_4 (y, x) float64 9.872 5.329 4.786 10.41 10.96 6.878 7.356 10.11 ...
band_5 (y, x) float64 7.343 5.88 3.924 11.82 11.53 10.16 10.78 11.74 ...
Attributes:
_dummy_canvas: True
band_order: ['band_1', 'band_2', 'band_3', 'band_4', 'band_5'],
array([1, 0, 0, 2, 2, 1, 1, 2, 2, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 2, 2, 0,
0, 0, 2, 1, 0, 2, 0, 2, 2, 1, 2, 1, 2, 0, 2, 2, 0, 0, 2, 1, 1, 2, 2,
0, 1, 2, 0, 1, 0, 1, 2, 0, 0, 0, 1, 1, 1, 2, 1, 1, 2, 2, 2, 0, 1, 1,
2, 0, 2, 2, 1, 0, 1, 2, 1, 0, 0, 1, 1, 1, 2, 1, 0, 2, 1, 0, 1, 2, 0,
0, 2, 1, 1, 0, 1, 2, 2, 1, 0, 2, 0, 1, 0, 1, 1, 2, 0, 0, 2, 1, 1, 1,
2, 2, 1, 0, 2]))
Pipeline with scoring¶
The example below sets up accuracy_score
for scoring a Pipeline that will flatten the sample and run SGDClassifier
. The scoring_kwargs
include greater_is_better
(passed to sklearn.model_selection.make_scorer
and score_weights
determining whether sort models from minimum to maximum fitness (-1) or maximum to minimum (1). Here we are maximimizing the accuracy_score
:
pipe = Pipeline([steps.Flatten(),
('sgd', SGDClassifier())],
scoring=accuracy_score,
scoring_kwargs=dict(greater_is_better=True, score_weights=[1]))
Read more documentation here on all the options available in elm.pipeline.steps
.
Call fit_ensemble¶
Calling fit_ensemble with an args_list
of length 3, we are fitting all models in the ensemble to the same sample in one generation, then proceeding to fitting all models against the next sample in the next generation. In this case we have 3 generations (ngen
above) and 3 samples (len(args_list)
below) and models_share_sample=True
. Each generation will have be a different sample and all models in a generation will be fitted to that sample.
data_source = dict(sampler=sampler_train, args_list=[(100, 120)] * 3)
fitted = pipe.fit_ensemble(**data_source, **ensemble_kwargs)
Call predict_many¶
We currently have 2 models in the ensemble (see saved_ensemble_size
above that set the top N models to keep) and an args_list
that will generate 3 samples: predict_many will predict 6 sample - model combinations.
preds = pipe.predict_many(**data_source)
Checking the number of predictions returned:
In [7]: len(preds)
Out[7]: 6
Each item in preds
is an ElmStore
with a DataArray
called predict
. In this case that DataArray
is a 2-D raster because we used the default keyword argument to_raster=True
when predict_many was called. The next snippet shows using the plot attribute of the predict
DataArray
:
See also documentation on plotting with xarray
p = preds[0]
p.predict.plot.pcolormesh()
Predicting from an Ensemble Subset¶
By default predict_many will look for an attribute on the Pipeline instance called .ensemble
, which is expected to be a list of (tag, pipeline)
tuples, and predict from each trained Pipeline instance in .ensemble
. Alternatively you can pass a list of (tag, pipeline)
tuples as ensemble
keyword argument. The example below predicts only from the best model in the ensemble (the final ensemble is sorted by model score if scoring
was given to Pipeline initialization). There are 3 predictions because there are 3 samples.
In [16]: subset = pipe.ensemble[:1]
In [17]: preds = pipe.predict_many(ensemble=subset, **data_source)
In [18]: len(preds)
Out[18]: 3
Predictions Too Large For Memory¶
In the examples above, predict_many has returned a list of ElmStore
objects. If the number of samples and/or models is large then keeping them all predictions in memory in a list is infeasible. In these cases, pass a serialize
argument (callable) to predict_many to serialize prediction ElmStore
outputs as they are generated. serialize
should have a signature exactly like the example below:
import os
from sklearn.externals import joblib
def serialize(y, X, tag, elm_predict_path):
dirr = os.path.join(elm_predict_path, tag)
if not os.path.exists(dirr):
os.mkdir(dirr) # assuming ELM_PREDICT_PATH in environment
base = "_".join('{:.02f}'.format(_) for _ in sorted(X.canvas.bounds))
joblib.dump(y, os.path.join(dirr, base + '.xr'))
return X.canvas
preds = pipe.predict_many(ensemble=pipe.ensemble[:1], serialize=serialize,**data_source)
In predicting over 3 samples and one model, we have created 3 joblib
dump prediction files and returned three Canvas
objects
In [27]: preds
Out[27]:
(Canvas(geo_transform=(-180, 0.1, 0, 90, 0, -0.1), buf_xsize=10, buf_ysize=10, dims=('y', 'x'), ravel_order='C', zbounds=None, tbounds=None, zsize=None, tsize=None, bounds=BoundingBox(left=-180.0, bottom=90.0, right=-179.1, top=89.1)),
Canvas(geo_transform=(-180, 0.1, 0, 90, 0, -0.1), buf_xsize=10, buf_ysize=10, dims=('y', 'x'), ravel_order='C', zbounds=None, tbounds=None, zsize=None, tsize=None, bounds=BoundingBox(left=-180.0, bottom=90.0, right=-179.1, top=89.1)),
Canvas(geo_transform=(-180, 0.1, 0, 90, 0, -0.1), buf_xsize=10, buf_ysize=10, dims=('y', 'x'), ravel_order='C', zbounds=None, tbounds=None, zsize=None, tsize=None, bounds=BoundingBox(left=-180.0, bottom=90.0, right=-179.1, top=89.1)))
(Canvas(geo_transform=(-180, 0.1, 0, 90, 0, -0.1), buf_xsize=10, buf_ysize=10, dims=('y', 'x'), ravel_order='C', zbounds=None, tbounds=None, zsize=None, tsize=None, bounds=BoundingBox(left=-180.0, bottom=90.0, right=-179.1, top=89.1)),
Canvas(geo_transform=(-180, 0.1, 0, 90, 0, -0.1), buf_xsize=10, buf_ysize=10, dims=('y', 'x'), ravel_order='C', zbounds=None, tbounds=None, zsize=None, tsize=None, bounds=BoundingBox(left=-180.0, bottom=90.0, right=-179.1, top=89.1)),
Canvas(geo_transform=(-180, 0.1, 0, 90, 0, -0.1), buf_xsize=10, buf_ysize=10, dims=('y', 'x'), ravel_order='C', zbounds=None, tbounds=None, zsize=None, tsize=None, bounds=BoundingBox(left=-180.0, bottom=90.0, right=-179.1, top=89.1)))
Here are some notes on the arguments passed to serialize
if given:
- y is an
ElmStore
either 1-D or 2-D (seeto_raster
keyword to predict_many) - X is the
X
ElmStore
that was used for prediction (the Pipeline will preserveattrs
inX
useful for serializingy
as in the example above which used the .canvas attribute ofX
) - tag is a unique tag of sample and Pipeline instance
- elm_predict_path is the root dir for serialization output -
ELM_PREDICT_PATH
from environment variables.
Parallel Prediction¶
To run predict_many (or fit_ensemble or fit_ea) in parallel using a dask-distributed client or dask ThreadPool
client, use elm.config.client_context
as shown here (continuing with the namespace defined by the snippets above)
First make sure you are running a dask-scheduler
and dask-worker
. Read more here on dask-distributed.
with client_context(dask_executor='DISTRIBUTED', dask_scheduler='10.0.0.10:8786') as client:
ensemble_kwargs['client'] = client
fitted = pipe.fit_ensemble(**data_source, **ensemble_kwargs)
preds = pipe.predict_many(client=client, **data_source)
In the example above, client_context
could have been called with no arguments if DASK_EXECUTOR
and DASK_SCHEDULER
environment variables.
With parallel predict_many
, each ensemble member / sample combination is a separate task - there is no parallelism within transformations of the Pipeline
.
Train / Predict From Yaml Config
elm
yaml
Specs¶
Workflows involving ensemble and evolutionary methods and predict_many can also be specified in a yaml
config file for running with the elm-main console entry point. The yaml
config can refer to functions from elm
or user-given packages or modules. Read more the yaml configuration file format here
The repository elm-examples has a number of example yaml
configuration files for GeoTiff and HDF4
files as input to K-Means or stochastic gradient descent classifiers.
This page walks through each part of a valid yaml
config.
ensembles
¶
The ensembles
section creates named dicts of keyword arguments to fit_ensemble. The example below creates example_ensemble
, an identifier we can use elsewhere in the config. If passing the keyword ensemble_init_func
in an ensemble here, then it should be given in “package.subpackage.module:callable” notation like a setup.py console entry point, e.g. "my_kmeans_module:make_ensemble"
.
ensembles: {
example_ensemble: {
init_ensemble_size: 1,
saved_ensemble_size: 1,
ngen: 3,
partial_fit_batches: 2,
},
}
data_sources
¶
The dicts in data_sources
create a named sampler
with their keyword arguments.
In the config, args_list
can be a callable. In this case, it is iter_files_recursively
a function which takes top_dir
and file_pattern
as arguments. The filenames returned by iter_files_recursively
will be filtered by example_meta_is_day
an example function for detecting whether a satellite data file is night or day based on its metadata. If args_list
is callable, it should take a variable number of keyword arguments (**kwargs
).
This examples creates ds_example
which selects from files to get bands 1 through 6, iterating recursively over .hdf
files in ELM_EXAMPLE_DATA_PATH
from the environment (env:SOMETHING
means take SOMETHING
from environment variables).
band_specs
in the data source are passed to elm.readers.BandSpec
(See also ElmStore and LANDSAT Example ) and determine which bands (subdatasets in this HDF4 case) to include in a sample.
data_sources: {
ds_example: {
sampler: "elm.sample_util.band_selection:select_from_file",
band_specs: [{search_key: long_name, search_value: "Band 1 ", name: band_1},
{search_key: long_name, search_value: "Band 2 ", name: band_2},
{search_key: long_name, search_value: "Band 3 ", name: band_3},
{search_key: long_name, search_value: "Band 4 ", name: band_4},
{search_key: long_name, search_value: "Band 5 ", name: band_5},
{search_key: long_name, search_value: "Band 6 ", name: band_6},],
args_list: "elm.readers.local_file_iterators:iter_files_recursively",
top_dir: "env:ELM_EXAMPLE_DATA_PATH",
metadata_filter: "elm.sample_util.metadata_selection:example_meta_is_day",
file_pattern: "\\.hdf",
},
}
See also Creating an ElmStore from File
model_scoring
¶
Each dict in model_scoring
has a scoring
callable and the other keys/values are passed as scoring_kwargs
. These in turn become the scoring
and scoring_kwargs
to initialize a Pipeline
instance. This example creates a scorer called kmeans_aic
model_scoring: {
kmeans_aic: {
scoring: "elm.model_selection.kmeans:kmeans_aic",
score_weights: [-1],
}
}
transform
¶
This section allows using transform model, such as IncrementalPCA
from sklearn.decomposition
. model_init_kwargs
can include any keyword argument to the model_init_class
, as well as partial_fit_batches
(partial_fit
operations on each Pipeline
fit
or partial_fit
).
transform: {
pca: {
model_init_class: "sklearn.decomposition:IncrementalPCA",
model_init_kwargs: {"n_components": 2, partial_fit_batches: 2},
}
}
sklearn_preprocessing
¶
This section configures scikit-learn preprocessing classes (sklearn.preprocessing
), such as PolynomialFeatures
, for use elsewhere in the config. Each key is an identifer and each dictionary contains a method
(imported from sklearn.preprocessing
) and keyword arguments to that method
.
sklearn_preprocessing: {
min_max: {
method: MinMaxScaler,
feature_range: [0, 1],
},
poly2_interact: {
method: PolynomialFeatures,
degree: 2,
interaction_only: True,
include_bias: True,
},
}
train
¶
The train
dict configures the final estimator in a Pipeline
, in this case MiniBatchKMeans
. This example shows how to run that estimator with the example_ensemble
keyword arguments from above, model scoring section from above (kmeans_aic
), passing drop_n
and evolve_n
to the model_selection
callable.
train: {
train_example: {
model_init_class: "sklearn.cluster:MiniBatchKMeans",
model_init_kwargs: {
compute_labels: True
},
ensemble: example_ensemble,
model_scoring: kmeans_aic,
model_selection: "elm.model_selection.kmeans:kmeans_model_averaging",
model_selection_kwargs: {
drop_n: 4,
evolve_n: 4,
}
}
}
feature_selection
¶
Each key in this section is an identifier and the each dict is a feature selector configuration, naming a method
to be imported from sklearn.preprocessing
and keyword arguments to that method
.
feature_selection: {
top_half: {
method: SelectPercentile,
percentile: 50,
score_func: f_classif
}
}
run
¶
The run
section names fitting and prediction jobs to be done by using identifiers created in the config’s dictionaries reviewed above.
- About the
run
section: - It is a list of actions
- Each action in the list is a dict
- Each action should have the key
pipeline
that is a list of dictionaries specifying steps (analogous to the interactive session Pipeline ) - Each action should have a
data_source
key pointing to one of thedata_sources
named above - Each action can have
predict
and/ortrain
key/value with the value being one of the namedtrain
dicts above
run:
- {pipeline: [{select_canvas: band_1},
{flatten: True},
{drop_na_rows: True},
{sklearn_preprocessing: poly2_interact},
{sklearn_preprocessing: min_max},
{transform: pca}],
data_source: ds_example,
predict: train_example,
train: train_example}
The example above showed a run
configuration with a pipeline
of transforms inclusive of flattening rasters, dropping null rows, adding polynomial interaction terms, min-max scaling, and PCA.
Valid steps for run
- pipeline
¶
This section shows all of the valid steps that can be a config’s run
- pipeline
lists (items that could have appeared in teh pipeline
list in preceding example).
flatten
Flattens 2-D each DataArray
raster to a column within a single DataArray
called flat
in an ElmStore.
{flatten: True}
See also transform-flatten.
See also: :docs:`elm.pipeline.steps<pipeline-steps>`
drop_na_rows
Drops null rows from an ElmStore
or xarray.Dataset
with a DataArray
called flat
(often this step follows {flatten: True} in a ``pipeline
).
{drop_na_rows: True}
See also transform-dropnarows.
modify_sample
Provides a callable and optionally keyword arguments to modify X
and optionally y
and sample_weight
. See example of interactive use of elm.pipeline.steps.ModifySample
here - TODO LINK and the function signature for a modify_sample
callable here - TODO LINK. This example shows how to run normalizer_func
imported from a package and subpackage, passing keyword_1
and keyword_2
.
{modify_sample: "mypackage.mysubpkg.mymodule:normalizer_func", keyword_1: 4, keyword_2: 99}
See also ModifySample
usage in a K-Means LANDSAT example .
transpose
Transpose the dimensions of the ElmStore
, like this example for converting from ("y", "x")
dims to ("x", "y")
dims.
{transpose: ['x', 'y']}
sklearn_preprocessing
If a config has a dict called sklearn_preprocessing
as in the example above, then named preprocessors in that dict can be used in the run
- pipeline
lists as follows:
{sklearn_preprocessing: poly2_interact}
where poly2_interact
is a key in sklearn_preprocessing
See also: elm.pipeline.steps.PolynomialFeatures
in elm.pipeline.steps
feature_selection
If a config has a dict called feature_selection
as in the example above, then named feature selectors there can be used in the run
- pipeline
section like this:
{feature_selection: top_half}
where top_half
is a named feature selector in feature_selection
.
transform
Note the config’s transform
section configures transform models like PCA but they are not used unless the config’s run
- pipeline
lists have a transform
action (dict) in them. Here is an example:
{transform: pca}
where pca
is a key in the config’s transform
dict.
elm-main Entry Point¶
elm-main runs the training and prediction steps configured in a yaml
file, as described in the yaml config examples. If you have not used yaml
before, read about yaml here first.
The simple use of elm-main is to run one yaml
config:
elm-main --config elm-examples/configs/kmeans_hdf4.yaml
elm-main uses the environment variables described here, many of which may be overriden by optional arguments to elm-main (see below).
The next section goes over all the command line options for elm-main
Config(s) To Run¶
elm-main can run a single yaml
config or a directory of yaml
config files. To run with a single yaml
file, use the --config
argument as above, or to run with a directory of config yaml
files, use --config-dir
Controlling Train vs. Predict¶
The following arguments control which parts of the config are being run:
--train-only
: Run only the training actions specified in therun
section of config--predict-only
: Run only the predict actions specified in config
Overriding Arguments to fit_ensemble
¶
The following arguments, if given, will override similarly named values in the yaml
config that are associated with the train
section (configuration of a final estimator in an Pipeline:
--partial-fit-batches
: Number ofpartial_fit
batches for final estimator (this does not controlpartial_fit
batches within atransform
step inrun
-pipeline
steps)--init-ensemble-size
: Initial ensemble size (ignored ifensemble_init_func
is given in config(s))--saved-ensemble-size
: Final number of trained models to serialize in eachtrain
action of therun
section
Use Dask Client¶
To use a dask-distributed
or dask ThreadPool
client, use the environment variables described here - or override them with command line arguments to elm-main:
--dask-executor
: One of [DISTRIBUTED
SERIAL
orTHREAD_POOL
]--dask-scheduler
: Dask-distributed scheduler url, e.g.10.0.0.10:8786
Directories for Serialization¶
The following arguments control where trained models and predictions are saved:
--elm-train-path
: TrainedPipeline
instances are saved here - see alsoELM_TRAIN_PATH
in environment variables.--elm-predict-path
: Predictions are saved here - see alsoELM_PREDICT_PATH
in environment variables.
Help for elm-main¶
Here is the full help for elm-main:
$ elm-main --help
usage: elm-main [-h] [--config CONFIG | --config-dir CONFIG_DIR]
[--train-only | --predict-only]
[--partial-fit-batches PARTIAL_FIT_BATCHES]
[--init-ensemble-size INIT_ENSEMBLE_SIZE]
[--saved-ensemble-size SAVED_ENSEMBLE_SIZE] [--ngen NGEN]
[--dask-threads DASK_THREADS]
[--max-param-retries MAX_PARAM_RETRIES]
[--dask-executor {DISTRIBUTED,SERIAL,THREAD_POOL}]
[--dask-scheduler DASK_SCHEDULER]
[--elm-example-data-path ELM_EXAMPLE_DATA_PATH]
[--elm-train-path ELM_TRAIN_PATH]
[--elm-predict-path ELM_PREDICT_PATH]
[--elm-logging-level {INFO,DEBUG}]
Pipeline classifier / predictor using ensemble and partial_fit methods
optional arguments:
-h, --help show this help message and exit
--train-only Run only the training, not prediction, actions
specified by config
--predict-only Run only the prediction, not training, actions
specified by config
--echo-config Output running config as it is parsed
Inputs:
Input config file or directory
--config CONFIG Path to yaml config
--config-dir CONFIG_DIR
Path to a directory of yaml configs
Run:
Run options
Control:
Keyword arguments to elm.pipeline.ensemble
--partial-fit-batches PARTIAL_FIT_BATCHES
Partial fit batches (for estimator specified in
config's "train"
--init-ensemble-size INIT_ENSEMBLE_SIZE
Initial ensemble size (ignored if using
"ensemble_init_func"
--saved-ensemble-size SAVED_ENSEMBLE_SIZE
How many of the "best" models to serialize
--ngen NGEN Number of ensemble generations, defaulting to ngen
from ensemble_kwargs in config
Environment:
Compute settings (see also help on environment variables)
--dask-threads DASK_THREADS
See also env var DASK_THREADS
--dask-processes DASK_PROCESSES
See also env var DASK_PROCESSES
--max-param-retries MAX_PARAM_RETRIES
See also env var MAX_PARAM_RETRIES
--dask-executor {DISTRIBUTED,SERIAL,THREAD_POOL}
See also DASK_EXECUTOR
--dask-scheduler DASK_SCHEDULER
See also DASK_SCHEDULER
--elm-example-data-path ELM_EXAMPLE_DATA_PATH
See also ELM_EXAMPLE_DATA_PATH
--elm-train-path ELM_TRAIN_PATH
See also ELM_TRAIN_PATH
--elm-predict-path ELM_PREDICT_PATH
See also ELM_PREDICT_PATH
--elm-logging-level {INFO,DEBUG}
See also ELM_LOGGING_LEVEL
--elm-configs-path ELM_CONFIGS_PATH
See also ELM_CONFIGS_PATH
--elm-large-test ELM_LARGE_TEST
See also ELM_LARGE_TEST
Pipeline classifier / predictor using ensemble and partial_fit methods
optional arguments:
-h, --help show this help message and exit
--config CONFIG Path to yaml config
--config-dir CONFIG_DIR
Path to a directory of yaml configs
--dask-threads DASK_THREADS
See also env var DASK_THREADS
--dask-processes DASK_PROCESSES
See also env var DASK_PROCESSES
--max-param-retries MAX_PARAM_RETRIES
See also env var MAX_PARAM_RETRIES
--dask-executor {DISTRIBUTED,SERIAL,THREAD_POOL}
See also DASK_EXECUTOR
--dask-scheduler DASK_SCHEDULER
See also DASK_SCHEDULER
--ladsweb-local-cache LADSWEB_LOCAL_CACHE
See also LADSWEB_LOCAL_CACHE
--hashed-args-cache HASHED_ARGS_CACHE
See also HASHED_ARGS_CACHE
--elm-example-data-path ELM_EXAMPLE_DATA_PATH
See also ELM_EXAMPLE_DATA_PATH
--elm-train-path ELM_TRAIN_PATH
See also ELM_TRAIN_PATH
--elm-transform-path ELM_TRANSFORM_PATH
See also ELM_TRANSFORM_PATH
--elm-predict-path ELM_PREDICT_PATH
See also ELM_PREDICT_PATH
--elm-logging-level {INFO,DEBUG}
See also ELM_LOGGING_LEVEL
--elm-configs-path ELM_CONFIGS_PATH
See also ELM_CONFIGS_PATH
--echo-config Output running config as it is parsed
Help & Reference
- API
- Environment Variables
- py.test Unit Tests
- contributing
- Release Procedure
API¶
The API help information in this page is autogenerated from comments in the source.
The packages include:
- elm.readers, a collection of tools for reading HDF4, HDF5, NetCDF and GeoTiff files and directories of files
- elm.pipeline, the public interface for ensemble learning models (elm) and its Pipeline class for preprocessing transform steps before fitting in ensemble or in an evolutionary algorithm
elm.model_selection
, tools for scoring models, Pareto sorting models, and selecting the best ensemble members- elm.sample_util, modules of helper functions to define classes / functions for
elm.pipeline.steps
(the possible steps in aPipeline
)
elm.readers
¶
elm.pipeline
¶
elm.model_selection
¶
Environment Variables¶
The following are environment variables control elm-main
and are also inputs to other elm
functions like elm.config.client_context
(a dask client context):
DASK_EXECUTOR
: Dask executor to use. Choices[DISTRIBUTED, SERIAL, THREAD_POOL]
(default:SERIAL
)DASK_SCHEDULER
: Dask scheduler URL, such as10.0.0.10:8786
, if usingDASK_EXECUTOR=DISTRIBUTED
DASK_THREADS
: Number of threads if usingDASK_EXECUTOR==THREAD_POOL
ELM_EXAMPLE_DATA_PATH
: Path to local clone of http://github.com/ContinuumIO/elm-examples (used forpy.test
)ELM_LOGGING_LEVEL
: EitherINFO
(default) orDEBUG
ELM_PREDICT_PATH
: Base path for saving prediction outputELM_TRAIN_PATH
: Base path for saving trained ensemblesMAX_PARAM_RETRIES
: How many times to retry in genetic algorithm when parameters are repeatedly infeasible
Testing elm
py.test
Unit Tests¶
These testing instructions assume you have cloned the elm
repository locally and installed from source.
Note: Many tests are skipped if you have not defined the environment variable ELM_EXAMPLE_DATA_PATH
(referring to your local clone of elm-examples)
Run the faster running tests:
py.test -m "not slow"
Running all tests:
py.test
or get the verbose test output
py.test -v
and cut and paste a test mark to run a specific test:
py.test -k test_bad_train_config
When running py.test
the environment variables related to dask determine whether dask-distributed or thread pool client or serial evaluation is used (See also dask-distributed).
Longer Running Tests¶
The elm-run-all-tests
console entry point can automate running of some or all python scripts and yaml
elm-main
config files in elm-examples
and/or the py.test
unit tests.
Here is an example that is run from inside the cloned elm
repository with elm-examples
cloned in the current directory (see the first two arguments: ./ - cloned elm
repo and ./elm-examples
- the location of cloned elm-examples
)
ELM_LOGGING_LEVEL=DEBUG elm-run-all-tests ./ elm-examples/ --skip-pytest --skip-scripts --dask-clients SERIAL DISTRIBUTED --dask-scheduler 10.0.0.10:8786
Above the arguments --skip-scripts
and skip-pytest
refer to skipping the scripts in elm-examples
and py.test``s in ``elm
, respectively, so the command above will run the all configs in elm-examples/configs
once for serial evaluation and once with dask-distributed.
Here is the full help on elm-run-all-tests
entry point:
$ elm-run-all-tests --help
usage: elm-run-all-tests [-h] [--pytest-mark PYTEST_MARK]
[--dask-clients {ALL,SERIAL,DISTRIBUTED,THREAD_POOL} [{ALL,SERIAL,DISTRIBUTED,THREAD_POOL} ...]]
[--dask-scheduler DASK_SCHEDULER] [--skip-pytest]
[--skip-scripts] [--skip-configs]
[--add-large-test-settings]
[--glob-pattern GLOB_PATTERN]
[--remote-git-branch REMOTE_GIT_BRANCH]
repo_dir elm_examples_path
Run longer-running tests of elm
positional arguments:
repo_dir Directory that is the top dir of cloned elm repo
elm_examples_path Path to a directory which contains subdirectories
"scripts", "scripts", and "example_data" with yaml-
configs, python-scripts, and example data,
respectively
optional arguments:
-h, --help show this help message and exit
--pytest-mark PYTEST_MARK
Mark to pass to py.test -m (marker of unit tests)
--dask-clients {ALL,SERIAL,DISTRIBUTED,THREAD_POOL} [{ALL,SERIAL,DISTRIBUTED,THREAD_POOL} ...]
Dask client(s) to test: ['ALL', 'SERIAL',
'DISTRIBUTED', 'THREAD_POOL']
--dask-scheduler DASK_SCHEDULER
Dask scheduler URL
--skip-pytest Do not run py.test (default is run py.test as well as
configs)
--skip-scripts Do not run scripts from elm-examples
--skip-configs Do not run configs from elm-examples
--add-large-test-settings
Adjust configs for larger ensembles / param_grids
--glob-pattern GLOB_PATTERN
Glob within repo_dir
--remote-git-branch REMOTE_GIT_BRANCH
Run on a remote git branch
Release Procedure¶
- Ensure all tests pass.
- Tag commit and push to github
git tag -a x.x.x -m 'Version x.x.x'
git push upstream master --tags
- Build conda packages
Define platform/setup specific environment variables (fill in with your specifics)
# Location of your conda install. For me it's `~/anaconda/`
CONDA_DIR=~/anaconda/
# Platform code. For me it's `osx-64`
PLATFORM=osx-64
# Version number of elm being released (e.g. 0.2.0)
VERSION=0.2.0
# requires conda-build (conda install conda-build)
conda build conda.recipe/ --python 3.5 --no-anaconda-upload -c conda-forge
Next, cd into the folder where the builds end up.
cd $CONDA_DIR/conda-bld/$PLATFORM
Use conda convert
to convert over the missing platforms (skipping the one for
the platform you’re currently on):
conda convert --platform osx-64 elm-$VERSION*.tar.bz2 -o ../
conda convert --platform linux-64 elm-$VERSION*.tar.bz2 -o ../
conda convert --platform linux-32 elm-$VERSION*.tar.bz2 -o ../
conda convert --platform win-64 elm-$VERSION*.tar.bz2 -o ../
conda convert --platform win-32 elm-$VERSION*.tar.bz2 -o ../
Use anaconda upload
to upload the build to the elm
channel. This requires
you to be setup on anaconda.org, and have the proper credentials to push to
the elm
channel.
# requires anaconda-client (conda install anaconda-client)
anaconda login
anaconda upload $CONDA_DIR/conda-bld/*/elm-$VERSION*.tar.bz2 -u elm
- Repeat
conda build
andanaconda upload
steps above for--python 3.4
as well - Write the release notes:
- Run
git log
to get a listing of all the changes- Remove any covered in the previous release
- Summarize the rest to focus on user-visible changes and major new features
- Paste the notes into github, under n
releases
, thenTags
, thenEdit release notes
.