Welcome to Clustering’s documentation!¶
Clustering.jl is a Julia package for data clustering. The package provides a variety of clustering algorithms, as well as utilities for initialization and result evaluation.
Contents:
Overview¶
Clustering.jl provides functionalities in three aspects that are related to data clustering:
- Clustering initialization, e.g. K-means++ seeding.
- Clustering algorithms, e.g. K-means, K-medoids, Affinity propagation, and DBSCAN, etc.
- Clustering evaluation, e.g. Silhouettes and variational information.
Inputs¶
A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:
- Sample matrix
X
, where each columnX[:,i]
corresponds to an observed sample. - Distance matrix
D
, whereD[i,j]
indicates the distance between samplesi
andj
, or the cost of assigning one to the other.
Common Options¶
Many clustering algorithms are iterative procedures. There options are usually provided to the users to control the behavior the algorithm.
name description maxiter
Maximum number of iterations. tol
Tolerable change of objective at convergence.
The algorithm is considered to be converged when the change of objective value between consecutive iteration is below the specified value.
display
The level of information to be displayed.
This should be a symbol, which may take either of the following values:
:none
: nothing will be shown
:final
: only shows a brief summary when the- algorithm ends
:iter
: shows progess at each iteration
Results¶
A clustering algorithm would return a struct that captures both the clustering results (e.g. assignments of samples to clusters) and information about the clustering procedure (e.g. the number of iterations or whether the iterative update converged).
Generally, the resultant struct is defined as an instance of a sub-type of ClusteringResult
. The following generic methods are implemented for these subtypes (let R
be an instance):
-
nclusters
(R)¶ Get the number of clusters
-
assignments
(R)¶ Get a vector of assignments.
Let
a = assignments(R)
, thena[i]
is the index of the cluster to which thei
-th sample is assigned.
-
counts
(R)¶ Get sample counts of clusters.
Let
c = counts(R)
, thenc[k]
is the number of samples assigned to thek
-th cluster.
Clustering Initialization¶
A clustering algorithm usually relies on an initialization scheme to bootstrap the clustering procedure.
Seeding¶
Seeding is an important family of methods for clustering initialization, which generally refers to an procedure to select a few seeds from a data set, each serving as the initial center of a cluster.
Seeding functions¶
The packages provide two functions initseeds
and initseeds_by_costs
for seeding:
-
initseeds
(algname, X, k)¶ Select
k
seeds from a given sample matrixX
.It returns an integer vector of length
k
that contains the indexes of chosen seeds.Here,
algname
indicates the seeding algorithm, which should be a symbol that may take either of the following values:algname description :rand
Randomly select a subset as seeds :kmpp
Kmeans++ algorithm, i.e. choose seeds sequentially, the probability of an sample to be chosen is proportional to the minimum cost of assigning it to existing seeds.
Reference:
D. Arthur and S. Vassilvitskii (2007). K-means++: the Advantages of Careful Seeding. 18th Annual ACM-SIAM symposium on Discrete algorithms, 2007.
:kmcen
Choose the k
samples with highest centrality as seeds.
-
initseeds_by_costs
(algname, C, k)¶ Select
k
seeds based on a cost matrixC
.Here,
C[i,j]
is the cost of binding samplesi
andj
to the same cluster. One may, for example, use the squared Euclidean distance between samples as the costs.The argument
algname
determines the choice of algorithm (see above).
In practice, we found that Kmeans++ is the most effective method for initial seeding. Thus, we provide specific functions to simply the use of Kmeans++ seeding:
-
kmpp
(X, k)¶ Use Kmeans++ to choose
k
seeds from a data set given by a sample matrixX
.
-
kmpp_by_costs
(C, k)¶ Use Kmeans++ to choose
k
seeds based on a cost matrixC
.
Internals¶
In this package, each seeding algorithm is represented by a sub-type of SeedingAlgorithm
. Particularly, the random selection algorithm, Kmean++, and centrality-based algorithm are respectively represented by sub-types RandSeedAlg
, KmppAlg
, and KmCentralityAlg
.
For each sub type, the following methods are implemented:
-
initseeds!(iseeds, alg, X)
Select seeds from a given sample matrix
X
, and write the results toiseeds
.Parameters: - iseeds – An pre-allocated array to store the indexes of the chosen seeds.
- alg – The algorithm instance.
- X – The given data matrix. Each column of
X
is a sample.
Returns: iseeds
-
initseeds_by_costs!(iseeds, alg, C)
Select seeds based on a given cost matrix
C
, and write the results toiseeds
.Parameters: - iseeds – An pre-allocated array to store the indexes of the chosen seeds.
- alg – The algorithm instance.
- C – The cost matrix. The value of
C[i,j]
is the cost of binding samplesi
andj
into the same cluster.
Returns: iseeds
Note: For both functions above, the length of iseeds
determines the number of seeds to be selected.
To define a new seeding algorithm, one has to first define a sub type of SeedingAlgorithm
and implement the two functions above.
Clustering Algorithms¶
This package implements a variety of clustering algorithms:
K-means¶
K-means is a classic method for clustering or vector quantization. The K-means algorithms produces a fixed number of clusters, each associated with a center (also known as a prototype), and each sample belongs to a cluster with the nearest center.
From a mathematical standpoint, K-means is a coordinate descent algorithm to solve the following optimization problem:
Here, is the center of the k
-th cluster, and indicates the cluster for .
This package implements the K-means algorithm in the kmeans
function:
-
kmeans
(X, k; ...)¶ Performs K-means clustering over the given dataset.
Parameters: - X – The given sample matrix. Each column of
X
is a sample. - k – The number of clusters.
This function returns an instance of
KmeansResult
, which is defined as follows:type KmeansResult{T<:FloatingPoint} <: ClusteringResult centers::Matrix{T} # cluster centers, size (d, k) assignments::Vector{Int} # assignments, length n costs::Vector{T} # costs of the resultant assignments, length n counts::Vector{Int} # number of samples assigned to each cluster, length k cweights::Vector{Float64} # cluster weights, length k totalcost::Float64 # total cost (i.e. objective) iterations::Int # number of elapsed iterations converged::Bool # whether the procedure converged end
One may optionally specify some of the options through keyword arguments to control the algorithm:
name description default init
Initialization algorithm or initial seeds, which can be either of the following:
- a symbol indicating the name of seeding algorithm,
:rand
,:kmpp
, or:kmcen
(see Clustering Initialization) - an integer vector of length
k
that provides the indexes of initial seeds.
:kmpp
maxiter
Maximum number of iterations. 100
tol
Tolerable change of objective at convergence. 1.0e-6
weights
The weights of samples, which can be either of:
nothing
: each sample has a unit weight.- a vector of length
n
that gives the sample weights.
nothing
display
The level of information to be displayed. (see Common Options) :none
- X – The given sample matrix. Each column of
If you already have a set of initial center vectors, you may use kmeans!
instead:
-
kmeans!(X, centers; ...)
Performs K-means given initial centers, and updates the centers inplace.
Parameters: - X – The given sample matrix. Each column of
X
is a sample. - centers – The matrix of centers. Each column of
centers
is a center vector for a cluster.
Note: The number of clusters
k
is determined assize(centers, 2)
.Like
kmeans
, this function returns an instance ofKmeansResult
.This function accepts all keyword arguments listed above for
kmeans
(exceptinit
).- X – The given sample matrix. Each column of
Examples:
using Clustering
# make a random dataset with 1000 points
# each point is a 5-dimensional vector
X = rand(5, 1000)
# performs K-means over X, trying to group them into 20 clusters
# set maximum number of iterations to 200
# set display to :iter, so it shows progressive info at each iteration
R = kmeans(X, 20; maxiter=200, display=:iter)
# the number of resultant clusters should be 20
@assert nclusters(R) == 20
# obtain the resultant assignments
# a[i] indicates which cluster the i-th sample is assigned to
a = assignments(R)
# obtain the number of samples in each cluster
# c[k] is the number of samples assigned to the k-th cluster
c = counts(R)
# get the centers (i.e. mean vectors)
# M is a matrix of size (5, 20)
# M[:,k] is the mean vector of the k-th cluster
M = R.centers
Example with plot
using RDatasets
iris = dataset("datasets", "iris")
head(iris)
# K-means Clustering unsupervised machine learning example
using Clustering
features = permutedims(convert(Array, iris[:,1:4]), [2, 1]) # use matrix() on Julia v0.2
result = kmeans( features, 3 ) # onto 3 clusters
using Gadfly
plot(iris, x = "PetalLength", y = "PetalWidth", color = result.assignments, Geom.point)
K-medoids¶
K-medoids is a clustering algorithm that seeks a subset of points out of a given set such that the total costs or distances between each point to the closest point in the chosen subset is minimal. This chosen subset of points are called medoids.
This package implements a K-means style algorithm instead of PAM, which is considered to be much more efficient and reliable. Particularly, the algorithm is implemented by the kmedoids
function.
-
kmedoids
(C, k; ...)¶ Performs K-medoids clustering based on a given cost matrix.
Parameters: - C – The cost matrix, where
C[i,j]
is the cost of assigning samplej
to the medoidi
. - k – The number of clusters.
This function returns an instance of
KmedoidsResult
, which is defined as follows:type KmedoidsResult{T} <: ClusteringResult medoids::Vector{Int} # indices of medoids (k) assignments::Vector{Int} # assignments (n) acosts::Vector{T} # costs of the resultant assignments (n) counts::Vector{Int} # number of samples assigned to each cluster (k) totalcost::Float64 # total assignment cost (i.e. objective) (k) iterations::Int # number of elapsed iterations converged::Bool # whether the procedure converged end
One may optionally specify some of the options through keyword arguments to control the algorithm:
name description default init
Initialization algorithm or initial medoids, which can be either of the following:
- a symbol indicating the name of seeding algorithm,
:rand
,:kmpp
, or:kmcen
(see Clustering Initialization) - an integer vector of length
k
that provides the indexes of initial seeds.
:kmpp
maxiter
Maximum number of iterations. 100
tol
Tolerable change of objective at convergence. 1.0e-6
display
The level of information to be displayed. (see Common Options) :none
- C – The cost matrix, where
-
kmedoids!(C, medoids, ...)
Performs K-medoids clustering based on a given cost matrix.
This function operates on an given set of medoids and updates it inplace.
Parameters: - C – The cost matrix, where
C[i,j]
is the cost of assigning samplej
to the medoidi
. - medoids – The vector of medoid indexes. The contents of
medoids
serve as the initial guess and will be overrided by the results.
This function returns an instance of
KmedoidsResult
.One may optionally specify some of the options through keyword arguments to control the algorithm:
name description default maxiter
Maximum number of iterations. 100
tol
Tolerable change of objective at convergence. 1.0e-6
display
The level of information to be displayed. (see Common Options) :none
- C – The cost matrix, where
Hierarchical Clustering¶
Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.
Functions
-
hclust
(D; linkage=:single)¶ Perform hierarchical clustering on distance matrix D with specified cluster linkage function.
Parameters: - D – The pairwise distance matrix.
D[i,j]
is the distance between pointsi
andj
. - linkage – A Symbol specifying how the distance between clusters (aka cluster linkage) is measured. It determines what clusters are merged on each iteration. Valid choices are:
:single
: use the minimum distance between any of the members:average
: use the mean distance between any of the cluster’s members:complete
: use the maximum distance between any of the members.:ward
: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters.:ward_presquared
: same as:ward
, but assumes that the distances inD
are already squared.
- D – The pairwise distance matrix.
- The function returns an object of type Hclust with the fields
merges
the sequence of subtree merges. Leafs are indicated by negative numbers, the ids of non-trivial subtrees refer to the rows in themerges
matrix and the elements of theheights
vector.heights
subtrees heights, i.e. the distances between left and right top branches of each subtree.order
indices of points ordered such that there are no intersecting branches on the dendrogram plot. This ordering brings points of the same cluster close together.linkage
the cluster linkage used.
Example:
D = rand(1000, 1000) D += D' # symmetric distance matrix (optional) result = hclust(D, linkage=:single)
-
cutree
(result; [k=nothing], [h=nothing])¶ Cuts the dendrogram to produce clusters at the specified level of granularity.
Parameters: - result – Object of type
Hclust
holding results of a call tohclust()
. - k – Integer specifying the number of desired clusters.
- h – Real specifying the height at which to cut the tree.
- result – Object of type
If both k and h are specified, it’s guaranteed that the number of clusters is ≥ k
and their height ≤ h
.
The output is a vector specifying the cluster index for each datapoint.
Affinity Propagation¶
Affinity propagation is a clustering algorithm based on message passing between data points. Similar to K-medoids, it finds a subset of points as exemplars based on (dis)similarities, and assigns each point in the given data set to the closest exemplar.
This package implements the affinity propagation algorithm based on the following paper:
Brendan J. Frey and Delbert Dueck. Clustering by Passing Messages Between Data Points. Science, vol 315, pages 972-976, 2007.
The implementation is optimized by reducing unnecessary array allocation and fusing loops. Specifically, the algorithm is implemented by the affinityprop
function:
-
affinityprop
(S; ...)¶ Performs affinity propagation based on a similarity matrix
S
.Parameters: S – The similarity matrix. Here, S[i,j]
is the similarity (or negated distance) between samplesi
andj
wheni != j
; whileS[i,i]
reflects the availability of thei
-th sample as an exemplar.This function returns an instance of
AffinityPropResult
, defined as below:type AffinityPropResult <: ClusteringResult exemplars::Vector{Int} # indexes of exemplars (centers) assignments::Vector{Int} # assignments for each point iterations::Int # number of iterations executed converged::Bool # converged or not end
One may optionally specify the following keyword arguments:
name description default maxiter
Maximum number of iterations. 100
tol
Tolerable change of objective at convergence. 1.0e-6
damp
Dampening coefficient.
The value should be in
[0.0, 1.0)
. Larger value ofdamp
indicates slower (and probably more stable) update. Whendamp = 0
, it means no dampening is performed.0.5
display
The level of information to be displayed (see Common Options) :none
DBSCAN¶
Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm is proposed by:
Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu A density-based algorithm for discovering clusters in large spatial databases with noise. 1996.
Density Reachability
DBSCAN’s definition of cluster is based on the concept of density reachability: a point is said to be directly density reachable by another point if the distance between them is below a specified threshold and is surrounded by sufficiently many points. Then, is considered to be density reachable by if there exists a sequence such that and is directly density reachable from .
A cluster, which is a subset of the given set of points, satisfies two properties:
- All points within the cluster are mutually density-connected, meaning that for any two distinct points and in a cluster, there exists a point sucht that both and are density reachable from .
- If a point is density connected to any point of a cluster, it is also part of the cluster.
Functions
There are two different implementations of DBSCAN algorithm called by dbscan
function in this package:
- Using a distance (adjacency) matrix and is O(N^2) in memory usage. Note that the boundary points are not unique.
-
dbscan
(D, eps, minpts) Perform DBSCAN algorithm based on a given distance matrix.
Parameters: - D – The pairwise distance matrix.
D[i,j]
is the distance between pointsi
andj
. - eps – The radius of a neighborhood.
- minpts – The minimum number of neighboring points (including self) to qualify a point as a density point.
The algorithm returns an instance of
DbscanResult
, defined as below:type DbscanResult <: ClusteringResult seeds::Vector{Int} # starting points of clusters, size (k,) assignments::Vector{Int} # assignments, size (n,) counts::Vector{Int} # number of points in each cluster, size (k,) end
- D – The pairwise distance matrix.
2. Using an adjacency list which is build on the fly. The performance is much better both in terms of runtime and memory usage. Also, the result is given in a DbscanCluster that provides the indices of all the core points and boundary points, such that boundary points can be associated with multiple clusters.
-
dbscan
(points, radius, leafsize=20, min_neighbors=1, min_cluster_size=1) Perform DBSCAN algorithm based on a collection of points.
Parameters: - points – matrix of points (column based)
- radius – The radius of a neighborhood.
- leafsize – number of points binned in each leaf node in the KDTree
- min_neighbors – minimum number of neighbors to be a core point
- min_cluster_size – minimum number of points to be a valid cluster
The algorithm returns an instance of
DbscanCluster
, defined as below:- immutable DbscanCluster <: ClusteringResult
- size::Int # number of points in cluster core_indices::Vector{Int} # core points indices boundary_indices::Vector{Int} # boundary points indices
end
Clustering Validation¶
This package provides a variety of ways to validate or evaluate clustering results:
Silhouettes¶
Silhouettes is a method for validating clusters of data. Particularly, it provides a quantitative way to measure how well each item lies within its cluster as opposed to others. The Silhouette value of a data point is defined as:
Here, is the average distance from the i
-th point to other points within the same cluster. Let be the average distance from the i
-th point to the points in the k
-th cluster. Then is the minimum of all over all clusters that the i
-th point is not assigned to.
Note that the value of is not greater than one, and that is close to one indicates that the i
-th point lies well within its own cluster.
-
silhouettes
(assignments, counts, dists) Compute silhouette values for individual points w.r.t. a given clustering.
Parameters: - assignments – the vector of assignments
- counts – the number of points falling in each cluster
- dists – the pairwise distance matrix
Returns: It returns a vector of silhouette values for individual points. In practice, one may use the average of these silhouette values to assess given clustering results.
-
silhouettes
(R, dists) This method accepts a clustering result
R
(of a sub-type ofClusteringResult
).It is equivalent to
silhouettes(assignments(R), counts(R), dists)
.
Variation of Information¶
Variation of information (also known as shared information distance) is a measure of the distance between two clusterings. It is devised based on mutual information, but it is a true metric, i.e. it satisfies symmetry and triangle inequality.
References:
Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173–187.
This package provides the varinfo
function that implements this metric:
-
varinfo
(k1, a1, k2, a2)¶ Compute the variation of information between two assignments.
Parameters: - k1 – The number of clusters in the first clustering.
- a1 – The assignment vector for the first clustering.
- k2 – The number of clusters in the second clustering.
- a2 – The assignment vector for the second clustering.
Returns: the value of variation of information.
-
varinfo
(R, k0, a0) This method takes
R
, an instance ofClusteringResult
, as input, and computes the variation of information between its corresponding clustering with one given by(k0, a0)
, wherek0
is the number of clusters in the other clustering, whilea0
is the corresponding assignment vector.
-
varinfo
(R1, R2) This method takes
R1
andR2
(both are instances ofClusteringResult
) and computes the variation of information between them.
Rand indices¶
Rand index is a measure of the similarity between two data clusterings. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.
References:
Lawrence Hubert and Phipps Arabie (1985). Comparing partitions. Journal of Classification 2 (1): 193–218
Meila, Marina (2003). Comparing Clusterings by the Variation of Information. Learning Theory and Kernel Machines: 173–187.
This package provides the randindex
function that implements several metrics:
-
randindex
(c1, c2)¶ Compute the tuple of indices (Adjusted Rand index, Rand index, Mirkin’s index, Hubert’s index) between two assignments.
Parameters: - c1 – The assignment vector for the first clustering.
- c2 – The assignment vector for the second clustering.
Returns: tuple of indices.
-
randindex
(R, c0) This method takes
R
, an instance ofClusteringResult
, as input, and computes the tuple of indices (see above) wherec0
is the corresponding assignment vector.
-
randindex
(R1, R2) This method takes
R1
andR2
(both are instances ofClusteringResult
) and computes the tuple of indices (see above) between them.
V-measure¶
The V-Measure is defined as the harmonic mean of homogeneity and completeness of the clustering. Both these measures can be expressed in terms of the mutual information and entropy measures of the information theory.
Homogeneity is maximized when each cluster contains elements of as few different classes as possible. Completeness aims to put all elements of each class in single clusters.
References:
Andrew Rosenberg and Julia Hirschberg, 2007. “V-Measure: A conditional entropy-based external cluster evaluation measure”
The metric is implemented by the vmeasure
function:
-
vmeasure
(assign1, assign2; β = 1.0)¶ Compute V-measure value between two clustering assignments.
Parameters: - assign1 – the vector of assignments for the first clustering.
- assign2 – the vector of assignments for the second clustering.
- β – the weight of harmonic mean of homogeneity and completeness.
Returns: a V-measure value.
-
vmeasure
(R, assign) This method takes
R
, an instance ofClusteringResult
, and the corresponding assignment vectorassign
as input, and computes V-measure value (see above).
-
vmeasure
(R1, R2) This method takes
R1
andR2
(both are instances ofClusteringResult
) and computes V-measure value (see above).It is equivalent tovmeasure(assignments(R1), assignments(R1))
.