Linear Manifold Clustering (LMCLUS)¶
Many clustering algorithms are based on the concept that a cluster has a a single center point. Clusters could be considered as groups of points compact around a linear manifold. A linear manifold of dimension 0 is a point. So clustering around a center point is a special case of linear manifold clustering.
Linear manifold clustering algorithm identifies subsets of the data which are embedded in arbitrary oriented lower dimensional linear manifolds, not nessesaraly zero dimensional. Minimal subsets of points are repeatedly sampled to construct trial a linear manifold and isolete points around it based of the proximity of points to the found manifold. Using top-down aproach, the linear manifold clustering alogorithm iterativly partitions dataset and discovers clusters embedded into low-dimensioanl linear subspaces [1].
LMCLUS.jl is a Julia package for linear manifold clustering.
Contents:
Linear Manifold Clustering¶
Linear manifold clustering algorithm (LMCLUS) discovers clusters which are described by a following model:
where is a dimension of the dataset,
is dimension of the manifold,
is a linear manifold translation vector,
is a matrix whose columns are orthonormal vectors that span
,
is a matrix whose columns span subspace orthogonal to spanned by columns of
,
is a zero-mean random vector whose entries are i.i.d. from a support of linear manifold,
is a zero-mean random vector with small variance independent of
.
Clustering¶
This package implements the LMCLUS algorithm in the lmclus
function:
-
lmclus
(X, p)¶ Performs linear manifold clustering over the given dataset.
Parameters: - X – The given sample matrix. Each column of
X
is a sample. - p – The clustering parameters as instance of LMCLUSParameters.
This function returns an
LMCLUSResult
instance.- X – The given sample matrix. Each column of
Results¶
Let M
be an instance of Manifold
, n
be the number of observations, and d
be the dimension of the linear manifold cluster.
-
indim
(M)¶ Returns a dimension of the observation space.
-
outdim
(M)¶ Returns a dimension of the linear manifold cluster which is the dimension of the subspace.
-
size
(M)¶ Returns the number of points in the cluster which is the size of the cluster.
-
points
(M)¶ Returns indexes of points assigned to the cluster.
-
mean
(M)¶ Returns the translation vector
which contains coordinates of the linear manifold origin.
-
projection
(M)¶ Returns the basis matrix with columns corresponding to orthonormal vectors that span the linear manifold.”
-
separation
(M)¶ Returns the instance of Separation object.
Example¶
using LMCLUS
# Load test data, remove label column and flip
X = readdlm(Pkg.dir("LMCLUS", "test", "testData"), ',')[:,1:end-1]'
# Initialize clustering parameters with
# maximum dimensionality for clusters.
# I should be less then original space dimension.
params = LMCLUSParameters(5)
# perform clustering and returns a collection of clusters
clust = lmclus(X, params)
# pick the first cluster
M = manifold(clust, 1)
# obtain indexes of points assigned to the cluster
l = points(M)
# obtain the linear manifold cluster translation vector
mu = mean(M)
# get basis vectors that span manifold as columns of the returned matrix
B = projection(M)
# get separation properties
S = separation(M)
Parameters¶
The clustering properties set in LMCLUSParameters
instance, which is defined as follows:
type LMCLUSParameters min_dim::Int # Minimum cluster dimension max_dim::Int # Maximum cluster dimension number_of_clusters::Int # Nominal number of resulting clusters hist_bin_size::Int # Fixed number of bins for the distance histogram. min_cluster_size::Int # Minimum cluster size best_bound::Float64 # Best bound error_bound::Float64 # Error bound max_bin_portion::Float64 # Maximum histogram bin size random_seed::Int64 # Random seed sampling_heuristic::Int # Sampling heuristic sampling_factor::Float64 # Sampling factor histogram_sampling::Bool # Sample points for distance histogram zero_d_search::Bool # Enable zero-dimensional manifold search basis_alignment::Bool # Manifold cluster basis alignment dim_adjustment::Bool # Manifold dimensionality adjustment dim_adjustment_ratio::Float64 # Ratio of manifold principal subspace variance mdl::Bool # Enable MDL heuristic mdl_model_precision::Int # MDL model precision encoding constant mdl_data_precision::Int # MDL data precision encoding constant mdl_quant_error::Float64 # Quantization error of a bin size calculation mdl_compres_ratio::Float64 # Cluster compression ration log_level::Int # Log level (0-5) end
Here is a description of algorithm parameters and their default values:
name | description | default |
---|---|---|
min_dim | Low bound of a cluster manifold dimension. | 1 |
max_dim | High bound of a cluster manifold dimension. It cannot be larger then a dimensionality of a dataset. | |
number_of_clusters | Expected number of clusters. Requred for the sampling heuristics. | 10 |
hist_bin_size | Number of bins for a distance histogram.
If this parameter is set to zero, the number of bins in
the distance histogram determined by parameter
max_bin_portion . |
0 |
min_cluster_size | Minimum size of a collection of data points to be considered as a proper cluster. | 20 |
best_bound | Separation best bound value is used for evaluating a goodness of separation characterized by a discriminability and a depth between modes of a distance histogram. | 1.0 |
error_bound | Sampling error bound determines a minimal number of samples required to correctly identify a linear manifold cluster. | 1e-4 |
max_bin_portion | Sampling error bound determines a minimal number of samples required to correctly identify a linear manifold cluster. Value should be selected from a (0,1) range. | 0.1 |
random_seed | Random number generator seed. If not specified then RNG will be reinitialized at every run. | 0 |
sampling_heuristic | The choice of heuristic method:
|
3 |
sampling_factor | Sampling factor used in the sampling heuristics (see above, options 2 & 3) to determine a number of samples as a percentage from a total dataset size. | 0.01 |
histogram_sampling | Turns on a sampling for a distance histogram.
Instead of computing the distance histogram from
a whole dataset, the algorithm draws a small sample for
the histogram construction, thus improving a its performance.
This parameter depends on a cluster_number value. |
false |
zero_d_search | Turn on/off zero dimensional manifold search. | false |
basis_alignment | Turn of/off an alignment of a manifold cluster basis. *If it’s on, a manifold basis of the generated cluster is aligned along the direction of the maximum variance (by performing PCA). | false |
dim_adjustment | Turn on/off a linear manifold cluster dimensionality detection by looking for portion of a variance associated with principal components. | false |
dim_adjustment_ratio | Ratio of manifold principal subspace variance. | 0.99 |
mdl | Turn on/off minimum description length heuristic for a complexity validation of a generated cluster. | false |
mdl_model_precision | MDL model precision encoding value. | 32 |
mdl_data_precision | MDL data precision encoding value. | 16 |
mdl_quant_error | Quantization error of a bin size calculation for a histogram which used in determining entropy value of the empirical distance distribution. | 1e-4 |
mdl_compres_ratio | Compression threshold value for discarding candidate clusters. | 1.05 |
log_level | Logging level (ranges from 0 to 5). | 0 |
Suggestions¶
Particular settings could impact performance of the algorithm:
- If you want a persistent clustering results fix a
random_seed
parameter. By default, RNG is reinitialized every time when algorithm runs. - If a dimensionality of the data is low, a histogram sampling could speeding up calculations.
- Value
1
ofsampling_heuristic
parameter should not be used if parametermax_dim
is large, as it will generate a very large number of samples. - Increasing value of
max_bin_portion
parameter could improve an efficiency of the clustering partitioning, but as well could degrade overall performance of the algorithm.
Parallelization¶
This implementation of LMCLUS algorithm uses parallel computations during a manifold sampling stage. You need add additional workers before executing the algorithm.
Separations¶
When linear manifold is formed, a distance from every point of dataset to the manifold is calculated, and a histograms of point distances to each trial manifold are computed. If the resulting histogram contains multiple modes then the mode near zero is isolated in histogram [1]. The isolated part of histogram is used to determine a separation criteria, and the data points are partitioned from the rest of the dataset on the basis of such separation.
The separation properties defined in Separation
instance, which is defined as follows:
type Separation depth::Float64 # Separation depth (depth between separated histogram modes) discriminability::Float64 # Separation discriminability (width between separated histogram modes) threshold::Float64 # Distance threshold value globalmin::Int # Global minimum as histogram bin index hist_range::Vector{Float64} # Histogram ranges hist_count::Vector{UInt32} # Histogram counts bin_index::Vector{UInt32} # Point to bin assignments end
Separation criteria and distance threshold value can be accessed through following functions:
-
criteria
(S)¶ Returns separation criteria value which is product of depth and discriminability.
-
threshold
(S)¶ Returns distance threshold value for separation calculated on histogram of distances. It is used to determine which points belong to formed cluster.
References
[1] |
|
Utilities¶
Linear Manifold Clustering Algorithm relies on multiple search and optimization methods:
-
mdl
(M, X; Pm = 32, Pd = 16, T = :Empirical, ɛ = 1e-4)¶ Performs calculation of the minimum description length for the linear manifold cluster.
Parameters: - M – Linear manifold cluster description as
Manifold
type instance. - X – Linear manifold cluster data as
Matrix
with points as its columns. - Pm – Precision encoding constant for the model, i.e. number of bits
required for encoding on element of the model description.
Default value is 32 which corresponds to
Float32
. - Pd – Precision encoding constant for the data.
- T – Type of a dataset encoding model as symbol:
:Gausian
,:Uniform
,:Empirical
. - ɛ – Error tolerance for bin quantization used in an empirical model encoding
Returns number of bits required to encode linear manifold cluster with the MDL schema.
- M – Linear manifold cluster description as
References
[1] |
|
[2] |
|
Notes:
All methods implemented in this package adopt the column-major convention: in a data matrix, each column corresponds to a sample/observation, while each row corresponds to a feature (variable or attribute).
References
[1] | Haralick, R. & Harpaz, R. “Linear manifold clustering in high dimensional spaces by stochastic search”, Pattern recognition, Elsevier, 2007, 40, 2672-2684, DOI:10.1016/j.patcog.2007.01.020 |