pyls: Partial Least Squares in Python¶
This package provides a Python interface for performing partial least squares (PLS) analyses.
Installation requirements¶
Currently, pyls
works with Python 3.5+ and requires a few dependencies:
h5py
numpy
scikit-learn
scipy, and
tqdm
Assuming you have the correct version of Python installed, you can install
pyls
by opening a terminal and running the following:
git clone https://github.com/rmarkello/pyls.git
cd pyls
python setup.py install
All relevant dependencies will be installed alongside the pyls
module.
Quickstart¶
There are a number of ways to use pyls
, depending on the type of analysis
you would like to perform. Assuming you have two matrices X
and Y
representing different observations from a set of samples (i.e., subjects,
neurons, brain regions), you can run a simple analysis with:
>>> import pyls
>>> results = pyls.behavioral_pls(X, Y)
For detailed information on the different methods available and how to interpret the results object, please refer to our user guide.
Development and getting involved¶
If you’ve found a bug, are experiencing a problem, or have a question about
using the package, please head on over to our GitHub issues and make a new
issue with some information about it! Someone will try and get back to you
as quickly as possible, though please note that the primary developer for
pyls
(@rmarkello) is a graduate student so responses make take some time!
If you’re interested in getting involved in the project: welcome ✨!
We’re thrilled to welcome new contributors. You should start by reading our
code of conduct; all activity on pyls
should adhere to the CoC. After
that, take a look at our contributing guidelines so you’re familiar with the
processes we (generally) try to follow when making changes to the repository!
Once you’re ready to jump in head on over to our issues to see if there’s
anything you might like to work on.
License Information¶
This codebase is licensed under the GNU General Public License, version 2. The
full license can be found in the LICENSE file in the pyls
distribution.
All trademarks referenced herein are property of their respective holders.
User guide¶
Partial least squares (PLS) is a multivariate statistical technique that aims to find shared information between two sets of variables. If you’re unfamiliar with PLS and are interested in a thorough (albeit quite technical) treatment, Abdi et al., 2013 is a good resource.
This user guide will go through the basic statistical concepts of the two types of PLS implemented in the current package (Behavioral PLS and Mean-centered PLS) and demonstrate how to interpret and use the results of a PLS analysis (PLS Results). If you still have questions after going through this guide then you can refer to the Reference API!
Behavioral PLS¶
Running a behavioral PLS using pyls
is as simple as:
>>> import pyls
>>> out = pyls.behavioral_pls(X, Y)
What we call behavioral PLS in the pyls
package is actually the more
traditional form of PLS (and is generally not prefixed with “behavioral”). This
form of PLS, at its core, attempts to find shared information between two sets
of features derived from a common set of samples. However, as with all things,
there are a number of ever-so-slightly different kinds of PLS that exist in the
wild, so to be thorough we’re going to briefly explain the exact flavor
implemented here before diving into a more illustrative example.
What exactly do we mean by “behavioral PLS”?¶
Technical answer: pyls.behavioral_pls()
employs a symmetrical,
singular value decomposition (SVD) based form of PLS, and is sometimes referred
to as PLS-correlation (PLS-C), PLS-SVD, or, infrequently, EZ-PLS. Notably, it
is not the same as PLS regression (PLS-R).
Less technical answer: pyls.behavioral_pls()
is like performing a
principal components analysis (PCA) but when you have two related datasets,
each with multiple features.
Differences from PLS regression (PLS-R)¶
You can think of the differences between PLS-C and PLS-R similar to how you might consider the differences between a Pearson correlation and a simple linear regression. Though this analogy is an over-simplification, the primary difference to take away is that behavioral PLS (PLS-C) does not assess directional relationships between sets of data (e.g., X → Y), but rather looks at how the two sets generally covary (e.g., X ↔ Y).
To understand this a bit more we can walk through a detailed example.
An exercise in calisthenics¶
Note
Descriptions of PLS are almost always accompanied by a litany of equations, and for good reason: understanding how to interpret the results of a PLS requires at least a cursory understanding of the math behind it. As such, this example is going to rely on these equations, but will always do so in the context of real data. The hope is that this approach will help make the more abstract mathematical concepts a bit more concrete (and easier to apply to new data sets!).
We’ll start by loading the example dataset 1:
>>> from pyls.examples import load_dataset
>>> data = load_dataset('linnerud')
This is the same dataset as in sklearn.datasets.load_linnerud()
; the
formatting has just been lightly modified to better suit our purposes.
Our data
object can be treated as a dictionary, containing all the
information necessary to run a PLS analysis. The keys can be accessed as
attributes, so we can take a quick look at our input matrices
\(\textbf{X}\) and \(\textbf{Y}\):
>>> sorted(data.keys())
['X', 'Y', 'n_boot', 'n_perm']
>>> data.X.shape
(20, 3)
>>> data.X.head()
Chins Situps Jumps
0 5.0 162.0 60.0
1 2.0 110.0 60.0
2 12.0 101.0 101.0
3 12.0 105.0 37.0
4 13.0 155.0 58.0
The rows of our \(\textbf{X}_{n \times p}\) matrix here represent n subjects, and the columns indicate p different types of exercises these subjects were able to perform. So the first subject was able to do 5 chin-ups, 162 situps, and 60 jumping jacks.
>>> data.Y.shape
(20, 3)
>>> data.Y.head()
Weight Waist Pulse
0 191.0 36.0 50.0
1 189.0 37.0 52.0
2 193.0 38.0 58.0
3 162.0 35.0 62.0
4 189.0 35.0 46.0
The rows of our \(\textbf{Y}_{n \times q}\) matrix also represent n subjects (critically, the same subjects as in \(\textbf{X}\)), and the columns indicate q physiological measurements taken for each subject. That same subject referenced above thus has a weight of 191 pounds, a 36 inch waist, and a pulse of 50 beats per minute.
Behavioral PLS will attempt to establish whether a relationship exists between the exercises performed and these physiological variables. If we wanted to run the full analysis right away, we could do so with:
>>> from pyls import behavioral_pls
>>> results = behavioral_pls(**data)
If you’re comfortable with the down-and-dirty of PLS and want to go ahead and
start understanding the results
object, feel free to jump ahead to
PLS Results. Otherwise, read on for more about what’s happening behind
the scenes of behavioral_pls()
The cross-covariance matrix¶
Behavioral PLS works by decomposing the cross-covariance matrix
\(\textbf{R}_{q \times p}\) generated from the input matrices, where
\(\textbf{R} = \textbf{Y}^{T} \textbf{X}\). The results of PLS are a
bit easier to interpret when \(\textbf{R}\) is the cross-correlation matrix
instead of the cross-covariance matrix, which means that we should z-score each
feature in \(\textbf{X}\) and \(\textbf{Y}\) before multiplying them;
this is done automatically by the behavioral_pls()
function.
In our example, \(\textbf{R}\) ends up being a 3 x 3 matrix:
>>> from pyls.compute import xcorr
>>> R = xcorr(data.X, data.Y)
>>> R
Chins Situps Jumps
Weight -0.389694 -0.493084 -0.226296
Waist -0.552232 -0.645598 -0.191499
Pulse 0.150648 0.225038 0.034933
The \(q\) rows of this matrix correspond to the physiological measurements
and the \(p\) columns to the exercises. Examining the first row, we can see
that -0.389694
is the correlation between Weight
and Chins
across
all the subjects, -0.493084
the correlation between Weight
and
Situps
, and so on.
Singular value decomposition¶
Once we have generated our correlation matrix \(\textbf{R}\) we subject it to a singular value decomposition, where \(\textbf{R} = \textbf{USV}^{T}\):
>>> from pyls.compute import svd
>>> U, S, V = svd(R)
>>> U.shape, S.shape, V.shape
((3, 3), (3, 3), (3, 3))
The outputs of this decomposition are two arrays of left and right singular vectors (\(\textbf{U}_{p \times l}\) and \(\textbf{V}_{q \times l}\)) and a diagonal matrix of singular values (\(\textbf{S}_{l \times l}\)). The rows of \(\textbf{U}\) correspond to the exercises from our input matrix \(\textbf{X}\), and the rows of \(\textbf{V}\) correspond to the physiological measurements from our input matrix \(\textbf{Y}\). The columns of \(\textbf{U}\) and \(\textbf{V}\), on the other hand, represent new dimensions or components that have been “discovered” in the data.
The \(i^{th}\) columns of \(\textbf{U}\) and \(\textbf{V}\) weigh the contributions of these exercises and physiological measurements, respectively. Taken together, the \(i^{th}\) left and right singular vectors and singular value represent a latent variable, a multivariate pattern that weighs the original exercise and physiological measurements such that they maximally covary with each other.
The \(i^{th}\) singular value is proportional to the total exercise-physiology covariance accounted for by the latent variable. The effect size (\(\eta\)) associated with a particular latent variable can be estimated as the ratio of the squared singular value (\(\sigma\)) to the sum of all the squared singular values:
We can use the helper function pyls.compute.varexp()
to calculate this
for us:
>>> from pyls.compute import varexp
>>> pctvar = varexp(S)[0, 0]
>>> print('{:.4f}'.format(pctvar))
0.9947
Taking a look at the variance explained, we see that a whopping ~99.5% of the covariance between the exercises and physiological measurements in \(\textbf{X}\) and \(\textbf{Y}\) are explained by this latent variable, suggesting that the relationship between these variable can be effectively explained by a single dimension.
Examining the weights from the singular vectors:
>>> U[:, 0]
array([0.61330742, 0.7469717 , 0.25668519])
>>> V[:, 0]
array([-0.58989118, -0.77134059, 0.23887675])
we see that all the exercises (U[:, 0]
) are positively weighted, but that
the physiological measurements (V[:, 0]
) are split, with Weight
and
Waist
measurements negatively weighted and Pulse
positively weighted.
(Note that the order of the weights is the same as the order of the original
columns in our \(\textbf{X}\) and \(\textbf{Y}\) matrices.) Taken
together this suggests that, for the subjects in this dataset, individuals who
completed more of a given exercise tended to:
Complete more of the other exercises, and
Have a lower weight, smaller waist, and higher heart rate.
It is also worth examining how correlated the projections of the original variables on this latent variable are. To do that, we can multiply the original data matrices by the relevant singular vectors and then correlate the results:
>>> from scipy.stats import pearsonr
>>> XU = np.dot(data.X, U)
>>> YV = np.dot(data.Y, V)
>>> r, p = pearsonr(XU[:, 0], YV[:, 0])
>>> print('r = {:.4f}, p = {:.4f}'.format(r, p))
r = 0.4900, p = 0.0283
The correlation value of this latent variable (~ 0.49
) suggests that our
interpretation of the singular vectors weights, above, is only somewhat
accurate. We can think of this correlation (ranging from -1 to 1) as a proxy
for the question: “how often is this interpretation of the singular vectors
true?” Correlations closer to -1 indicate that the interpretation is largely
inaccurate across subjects, whereas correlations closer to 1 indicate the
interpretation is largely accurate across subjects.
Latent variable significance testing¶
Scientists love null-hypothesis significance testing, so there’s a strong urge for researchers doing these sorts of analyses to want to find a way to determine whether observed latent variables are significant(ly different from a specified null model). The issue comes in determining what aspect of the latent variables to test!
With behavioral PLS we assess whether the variance explained by a given latent variable is significantly different than would be expected by a null. Importantly, that null is generated by re-computing the latent variables from random permutations of the original data, generating a non-parametric distribution of explained variances by which to measure “significance.”
Mean-centered PLS¶
In contrast to behavioral PLS, mean-centered PLS doesn’t aim to find relationships between two sets of variables. Instead, it tries to find relationships between groupings in a single set of variables. Indeed, you can think of it almost like a multivariate t-test or ANOVA (depending on how many groups you have).
An oenological example¶
>>> from pyls.examples import load_dataset
>>> data = load_dataset('wine')
This is the same dataset as in sklearn.datasets.load_wine()
; the
formatting has just been lightly modified to better suit our purposes.
Our data
object can be treated as a dictionary, containing all the
information necessary to run a PLS analysis. The keys can be accessed as
attributes, so we can take a quick look at our input matrix:
>>> sorted(data.keys())
['X', 'groups', 'n_boot', 'n_perm']
>>> data.X.shape
(178, 13)
>>> data.X.columns
Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
'proanthocyanins', 'color_intensity', 'hue',
'od280/od315_of_diluted_wines', 'proline'],
dtype='object')
>>> data.groups
[59, 71, 48]
PLS Results¶
So you ran a PLS analysis and got some results. Congratulations! The easy part
is done. 🙃 Interpreting (trying to interpret) the results of a PLS
analysis—similar to interpreting the results of a PCA or factor analysis or
CCA or any other complex decomposition—can be difficult. The pyls
package
contains some functions, tools, and data structures to try and help.
The PLSResults
data structure is, at its core, a
Python dictionary that is designed to contain all possible results from any of
the analyses available in pyls.types
. Let’s generate a small example
results object to play around with. We’ll use the dataset from the
Behavioral PLS example:
>>> from pyls.examples import load_dataset
>>> data = load_dataset('linnerud')
We can generate the results file by running the behavioral PLS analysis again.
We pass the verbose=False
flag to suppress the progress bar that would
normally be displayed:
>>> from pyls import behavioral_pls
>>> results = behavioral_pls(**data, verbose=False)
>>> results
PLSResults(x_weights, y_weights, x_scores, y_scores, y_loadings, singvals, varexp, permres, bootres, cvres, inputs)
Printing the results
object gives us a helpful view of some of the
different outputs available to us. While we won’t go into detail about all of
these (see the Reference API for info on those), we’ll touch on a few of the
potentially more confusing ones.
Reference API¶
This is the primary reference of pyls
. Please refer to the user guide for more information on how to best implement these functions in your
own workflows.
List of modules
pyls.structures
- PLS data structurespyls.io
- Data I/O functionalitypyls.matlab
- Matlab compatibility
pyls
- PLS decompositions¶
The primary PLS decomposition methods for use in conducting PLS analyses
|
Performs behavioral PLS on X and Y. |
|
Performs mean-centered PLS on X, sorted into groups and conditions. |
pyls.structures
- PLS data structures¶
Data structures to hold PLS inputs and results objects
|
Dictionary-like object containing results of PLS analysis |
|
Dictionary-like object containing results of PLS permutation testing |
|
Dictionary-like object containing results of PLS bootstrap resampling |
|
Dictionary-like object containing results of PLS split-half resampling |
Dictionary-like object containing results of PLS cross-validation testing |
|
|
PLS input information |
pyls.io
- Data I/O functionality¶
Functions for saving and loading PLS data objects
|
Saves PLS results to hdf5 file fname |
|
Load PLS results stored in fname, generated by pyls.save_results() |
pyls.matlab
- Matlab compatibility¶
Utilities for handling PLS results generated using the Matlab PLS toolbox
|
Imports fname PLS result from Matlab |