Welcome to MultivariateStats’s documentation¶

MultivariateStats.jl is a Julia package for multivariate statistical analysis. It provides a rich set of useful analysis techniques, such as PCA, CCA, LDA, PLS, etc.

Contents:

Linear Least Square and Ridge Regression¶

The package provides functions to perform Linear Least Square and Ridge Regression.

Linear Least Square¶

Linear Least Square is to find linear combination(s) of given variables to fit the responses by minimizing the squared error between them. This can be formulated as an optimization as follows: Sometimes, the coefficient matrix is given in a transposed form, in which case, the optimization is modified as: The package provides llsq to solve these problems:

llsq(X, y; ...)

Solve the linear least square problem formulated above.

Here, y can be either a vector, or a matrix where each column is a response vector.

This function accepts two keyword arguments:

• trans: whether to use the transposed form. (default is false)
• bias: whether to include the bias term b. (default is true)

The function results the solution a. In particular, when y is a vector (matrix), a is also a vector (matrix). If bias is true, then the returned array is augmented as [a; b].

Examples

For a single response vector y (without using bias):

using MultivariateStats

# prepare data
X = rand(1000, 3)               # feature matrix
a0 = rand(3)                    # ground truths
y = X * a0 + 0.1 * randn(1000)  # generate response

# solve using llsq
a = llsq(X, y; bias=false)

# do prediction
yp = X * a

# measure the error
rmse = sqrt(mean(abs2(y - yp)))
print("rmse = \$rmse")

For a single response vector y (using bias):

# prepare data
X = rand(1000, 3)
a0, b0 = rand(3), rand()
y = X * a0 + b0 + 0.1 * randn(1000)

# solve using llsq
sol = llsq(X, y)

# extract results
a, b = sol[1:end-1], sol[end]

# do prediction
yp = X * a + b'

For a matrix Y comprised of multiple columns:

# prepare data
X = rand(1000, 3)
A0, b0 = rand(3, 5), rand(1, 5)
Y = (X * A0 .+ b0) + 0.1 * randn(1000, 5)

# solve using llsq
sol = llsq(X, Y)

# extract results
A, b = sol[1:end-1,:], sol[end,:]

# do prediction
Yp = X * A .+ b'

Ridge Regression¶

Compared to linear least square, Ridge Regression uses an additional quadratic term to regularize the problem: The transposed form: The package provides ridge to solve these problems:

ridge(X, y, r; ...)

Solve the ridge regression problem formulated above.

Here, y can be either a vector, or a matrix where each column is a response vector.

The argument r gives the quadratic regularization matrix Q, which can be in either of the following forms:

• r is a real scalar, then Q is considered to be r * eye(n), where n is the dimension of a.
• r is a real vector, then Q is considered to be diagm(r).
• r is a real symmetric matrix, then Q is simply considered to be r.

This function accepts two keyword arguments:

• trans: whether to use the transposed form. (default is false)
• bias: whether to include the bias term b. (default is true)

The function results the solution a. In particular, when y is a vector (matrix), a is also a vector (matrix). If bias is true, then the returned array is augmented as [a; b].

Data Whitening¶

A whitening transformation is a decorrelation transformation that transforms a set of random variables into a set of new random variables with identity covariance (uncorrelated with unit variances).

In particular, suppose a random vector has covariance , then a whitening transform is one that satisfy: Note that is generally not unique. In particular, if is a whitening transform, so is any of its rotation with .

Whitening¶

The package uses Whitening defined below to represent a whitening transform:

immutable Whitening{T<:FloatingPoint}
mean::Vector{T}     # mean vector (can be empty to indicate zero mean), of length d
W::Matrix{T}        # the transform coefficient matrix, of size (d, d)
end

An instance of Whitening can be constructed by Whitening(mean, W).

There are several functions to access the properties of a whitening transform f:

indim(f)

Get the input dimension, i.e d.

outdim(f)

Get the out dimension, i.e d.

mean(f)

Get the mean vector.

Note: if f.mean is empty, this function returns a zero vector of length d.

transform(f, x)

Apply the whitening transform to a vector or a matrix with samples in columns, as .

Data Analysis¶

Given a dataset, one can use the fit method to estimate a whitening transform.

fit(Whitening, X; ...)

Estimate a whitening transform from the data given in X. Here, X should be a matrix, whose columns give the samples.

This function returns an instance of Whitening.

Keyword Arguments:

name description default
regcoef

The regularization coefficient. The covariance will be regularized as follows when regcoef is positive:

C + (eigmax(C) * regcoef) * eye(d)

zero(T)
mean

The mean vector, which can be either of:

• 0: the input data has already been centralized
• nothing: this function will compute the mean
• a pre-computed mean vector
nothing

Note: This function internally relies on cov_whiten to derive the transformation W. The function cov_whiten itself is also a useful function.

cov_whitening(C)

Derive the whitening transform coefficient matrix W given the covariance matrix C. Here, C can be either a square matrix, or an instance of Cholesky.

Internally, this function solves the whitening transform using Cholesky factorization. The rationale is as follows: let and , then .

Note: The return matrix W is an upper triangular matrix.

cov_whitening(C, regcoef)

Derive a whitening transform based on a regularized covariance, as C + (eigmax(C) * regcoef) * eye(d).

In addition, the package also provides cov_whiten!, in which the input matrix C will be overwritten during computation. This can be more efficient when C is no longer used.

invsqrtm(C)

Compute inv(sqrtm(C)) through symmetric eigenvalue decomposition.

Principal Component Analysis¶

Principal Component Analysis (PCA) derives an orthogonal projection to convert a given set of observations to linearly uncorrelated variables, called principal components.

This package defines a PCA type to represent a PCA model, and provides a set of methods to access the properties.

Properties¶

Let M be an instance of PCA, d be the dimension of observations, and p be the output dimension (i.e the dimension of the principal subspace)

indim(M)

Get the input dimension d, i.e the dimension of the observation space.

outdim(M)

Get the output dimension p, i.e the dimension of the principal subspace.

mean(M)

Get the mean vector (of length d).

projection(M)

Get the projection matrix (of size (d, p)). Each column of the projection matrix corresponds to a principal component.

The principal components are arranged in descending order of the corresponding variances.

principalvars(M)

The variances of principal components.

tprincipalvar(M)

The total variance of principal components, which is equal to sum(principalvars(M)).

tresidualvar(M)

The total residual variance.

tvar(M)

The total observation variance, which is equal to tprincipalvar(M) + tresidualvar(M).

principalratio(M)

The ratio of variance preserved in the principal subspace, which is equal to tprincipalvar(M) / tvar(M).

Transformation and Construction¶

Given a PCA model M, one can use it to transform observations into principal components, as or use it to reconstruct (approximately) the observations from principal components, as Here, is the projection matrix.

The package provides methods to do so:

transform(M, x)

Transform observations x into principal components.

Here, x can be either a vector of length d or a matrix where each column is an observation.

reconstruct(M, y)

Approximately reconstruct observations from the principal components given in y.

Here, y can be either a vector of length p or a matrix where each column gives the principal components for an observation.

Data Analysis¶

One can use the fit method to perform PCA over a given dataset.

fit(PCA, X; ...)

Perform PCA over the data given in a matrix X. Each column of X is an observation.

This method returns an instance of PCA.

Keyword arguments:

Let (d, n) = size(X) be respectively the input dimension and the number of observations:

name description default
method

The choice of methods:

• :auto: use :cov when d < n or :svd otherwise
• :cov: based on covariance matrix
• :svd: based on SVD of the input data
:auto
maxoutdim Maximum output dimension. min(d, n)
pratio The ratio of variances preserved in the principal subspace. 0.99
mean

The mean vector, which can be either of:

• 0: the input data has already been centralized
• nothing: this function will compute the mean
• a pre-computed mean vector
nothing

Notes:

• The output dimension p depends on both maxoutdim and pratio, as follows. Suppose the first k principal components preserve at least pratio of the total variance, while the first k-1 preserves less than pratio, then the actual output dimension will be min(k, maxoutdim).
• This function calls pcacov or pcasvd internally, depending on the choice of method.

Example:

using MultivariateStats

# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model
M = fit(PCA, Xtr; maxoutdim=100)

# apply PCA model to testing set
Yte = transform(M, Xte)

# reconstruct testing observations (approximately)
Xr = reconstruct(M, Yte)

Example with iris dataset and plotting:

using MultivariateStats, RDatasets, Plots
plotly() # using plotly for 3D-interacive graphing

iris = dataset("datasets", "iris")

# split half to training set
Xtr = convert(Array,DataArray(iris[1:2:end,1:4]))'
Xtr_labels = convert(Array,DataArray(iris[1:2:end,5]))

# split other half to testing set
Xte = convert(Array,DataArray(iris[2:2:end,1:4]))'
Xte_labels = convert(Array,DataArray(iris[2:2:end,5]))

# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model, allowing up to 3 dimensions
M = fit(PCA, Xtr; maxoutdim=3)

# apply PCA model to testing set
Yte = transform(M, Xte)

# reconstruct testing observations (approximately)
Xr = reconstruct(M, Yte)

# group results by testing set labels for color coding
setosa = Yte[:,Xte_labels.=="setosa"]
versicolor = Yte[:,Xte_labels.=="versicolor"]
virginica = Yte[:,Xte_labels.=="virginica"]

# visualize first 3 principal components in 3D interacive plot
p = scatter(setosa[1,:],setosa[2,:],setosa[3,:],marker=:circle,linewidth=0)
scatter!(versicolor[1,:],versicolor[2,:],versicolor[3,:],marker=:circle,linewidth=0)
scatter!(virginica[1,:],virginica[2,:],virginica[3,:],marker=:circle,linewidth=0)
plot!(p,xlabel="PC1",ylabel="PC2",zlabel="PC3")

Core Algorithms¶

Two algorithms are implemented in this package: pcacov and pcastd.

pcacov(C, mean; ...)

Compute PCA based on eigenvalue decomposition of a given covariance matrix C.

Parameters: C – The covariance matrix. mean – The mean vector of original samples, which can be a vector of length d, or an empty vector Float64[] indicating a zero mean. The resultant PCA model. This function accepts two keyword arguments: maxoutdim and pratio.
pcasvd(Z, mean, tw; ...)

Compute PCA based on singular value decomposition of a centralized sample matrix Z.

Parameters: Z – provides centralized samples. mean – The mean vector of the original samples, which can be a vector of length d, or an empty vector Float64[] indicating a zero mean. The resultant PCA model. This function accepts two keyword arguments: maxoutdim and pratio.

Probabilistic Principal Component Analysis¶

Probabilistic Principal Component Analysis (PPCA) represents a constrained form of the Gaussian distribution in which the number of free parameters can be restricted while still allowing the model to capture the dominant correlations in a data set. It is expressed as the maximum likelihood solution of a probabilistic latent variable model [BSHP06].

This package defines a PPCA type to represent a probabilistic PCA model, and provides a set of methods to access the properties.

Properties¶

Let M be an instance of PPCA, d be the dimension of observations, and p be the output dimension (i.e the dimension of the principal subspace)

indim(M)

Get the input dimension d, i.e the dimension of the observation space.

outdim(M)

Get the output dimension p, i.e the dimension of the principal subspace.

mean(M)

Get the mean vector (of length d).

projection(M)

Get the projection matrix (of size (d, p)). Each column of the projection matrix corresponds to a principal component.

The principal components are arranged in descending order of the corresponding variances.

var(M)

The total residual variance.

Transformation and Construction¶

Given a probabilistic PCA model M, one can use it to transform observations into latent variables, as or use it to reconstruct (approximately) the observations from latent variables, as Here, is the factor loadings or weight matrix.

The package provides methods to do so:

transform(M, x)

Transform observations x into latent variables.

Here, x can be either a vector of length d or a matrix where each column is an observation.

reconstruct(M, z)

Approximately reconstruct observations from the latent variable given in z.

Here, y can be either a vector of length p or a matrix where each column gives the latent variables for an observation.

Data Analysis¶

One can use the fit method to perform PCA over a given dataset.

fit(PPCA, X; ...)

Perform probabilistic PCA over the data given in a matrix X. Each column of X is an observation.

This method returns an instance of PCA.

Keyword arguments:

Let (d, n) = size(X) be respectively the input dimension and the number of observations:

name description default
method

The choice of methods:

• :ml: use maximum likelihood version of probabilistic PCA
• :em: use EM version of probabilistic PCA
• :bayes: use Bayesian PCA
:ml
maxoutdim Maximum output dimension. d-1
mean

The mean vector, which can be either of:

• 0: the input data has already been centralized
• nothing: this function will compute the mean
• a pre-computed mean vector
nothing
tol Convergence tolerance 1.0e-6
tot Maximum number of iterations 1000

Notes:

• This function calls ppcaml, ppcaem or bayespca internally, depending on the choice of method.

Example:

using MultivariateStats

# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model
M = fit(PPCA, Xtr; maxoutdim=100)

# apply PCA model to testing set
Yte = transform(M, Xte)

# reconstruct testing observations (approximately)
Xr = reconstruct(M, Yte)

Core Algorithms¶

Three algorithms are implemented in this package: ppcaml, ppcaem, and bayespca.

ppcaml(Z, mean, tw; ...)

Compute probabilistic PCA using on maximum likelihood formulation for a centralized sample matrix Z.

Parameters: Z – provides centralized samples. mean – The mean vector of the original samples, which can be a vector of length d, or an empty vector Float64[] indicating a zero mean. The resultant PPCA model. This function accepts two keyword arguments: maxoutdim and tol.
ppcaem(S, mean, n; ...)

Compute probabilistic PCA based on expectation-maximization algorithm for a given sample covariance matrix S.

Parameters: S – The sample covariance matrix. mean – The mean vector of original samples, which can be a vector of length d, or an empty vector Float64[] indicating a zero mean. n – The number of observations. The resultant PPCA model. This function accepts two keyword arguments: maxoutdim, tol, and tot.
bayespca(S, mean, n; ...)

Compute probabilistic PCA based on Bayesian algorithm for a given sample covariance matrix S.

Parameters: S – The sample covariance matrix. mean – The mean vector of original samples, which can be a vector of length d, or an empty vector Float64[] indicating a zero mean. n – The number of observations. The resultant PPCA model. This function accepts two keyword arguments: maxoutdim, tol, and tot.