Welcome to FSFC’s documentation!

Base Classes and Mixins

FSFC uses some base classes and mixins, based on sklearn estimators and classifiers. It makes the library fully compatible with sklearn, so feature selectors can be used in pipelines

Mixins

Some default mixins which extend ones defined in sklearn

class fsfc.mixins.KBestSelectorMixin[source]

Bases: fsfc.mixins.ScoreSelectorMixin

Mixin that selects K best features according to their scores

class fsfc.mixins.ScoreSelectorMixin[source]

Bases: object

Mixin that adds getter for calculation of scores of features and checks that scores are calculated

class fsfc.mixins.ThresholdSelectorMixin[source]

Bases: fsfc.mixins.ScoreSelectorMixin

Mixin that selects features according to some threshold. That means that all features whose score is higher than threshold are selected

Classes

Base classes of all feature selecting / clustering algorithms of FSFC

class fsfc.base.BaseFeatureSelector[source]

Bases: sklearn.base.BaseEstimator, sklearn.feature_selection.base.SelectorMixin

Base class for all feature selection algorithms. It’s SKLearn-compliant, so it can be used in pipelines.

Successors should override methods fit() and _get_support_mask().

Methods

fit(x, *rest) Fit selector to a dataset.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.
fit(x, *rest)[source]

Fit selector to a dataset.

Parameters:
x: ndarray

The input samples - array of shape [n_samples, n_features].

rest: list

List of miscellaneous arguments.

Returns:
selector: BaseFeatureSelector

Returns self to support chaining.

class fsfc.base.ClusteringFeatureSelector(k)[source]

Bases: fsfc.mixins.KBestSelectorMixin, fsfc.base.BaseFeatureSelector, sklearn.base.ClusterMixin

Clusters samples and simultaneously finds relevant features. Allows to transform dataset and select K best features according to features scores.

Parameters:
k: int

Number of features to select

Methods

fit(x, *rest)
fit_predict(X[, y]) Performs clustering on X and returns cluster labels.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.
fit(x, *rest)[source]
class fsfc.base.KBestFeatureSelector(k)[source]

Bases: fsfc.mixins.KBestSelectorMixin, fsfc.base.BaseFeatureSelector

Base class for algorithms that selects K best features according to features scores.

Successors should override method _calc_scores() for computation of the score.

Parameters:
k: int

Number of features to select

Methods

fit(x, *rest)
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.
fit(x, *rest)[source]

Methods for generic data

FSFC contains implementation of some feature selection algorithms working with generic data.

Every algorithm can be imported either from it’s package or from the fsfc.generic module

SPEC algorithm family

Family of algorithm based on the Spectral Graph Theory

class fsfc.generic.SPEC.FixedSPEC(k, clusters)[source]

Bases: fsfc.generic.SPEC.SPEC

Feature selection algorithm that represents samples as vertices of graph. Weights of edges of this graph are equal to RBF-distances between points.

Algorithm uses Spectral Graph Theory to find features with the best separability in assumption that points are separated to the predefined number of clusters

To do this, algorithm finds eigenvectors corresponding to K smallest eigenvalues of the Laplacian of the graph, except the trivial one, and uses cosine distance between them and feature vectors to detect the most explaining features. Algorithm selects k features using this score.

Parameters:
k: int

Number of features to select

clusters: int

Expected number of clusters

Methods

fit(x, *rest)
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.
class fsfc.generic.SPEC.GenericSPEC(k)[source]

Bases: fsfc.generic.SPEC.NormalizedCut

Feature selection algorithm that represents samples as vertices of graph. Weights of edges of this graph are equal to RBF-distances between points.

Algorithm uses Spectral Graph Theory to find features with the best separability.

To do this, algorithm finds the trivial eugenvector of the Laplacian of the graph and uses it to normalise scores computed using NormalizedCut. Such normalisation helps to improve accuracy of feature selection according to the article SPEC family is based on.

Algorithm selects top k features according to this scores

Parameters:
k: int

Number of features to select

Methods

fit(x, *rest)
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.
class fsfc.generic.SPEC.NormalizedCut(k)[source]

Bases: fsfc.generic.SPEC.SPEC

Feature selection algorithm that represents samples as vertices of graph. Weights of edges of this graph are equal to RBF-distances between points. Algorithm attempts to find minimal cut in this graph.

Algorithm selects k features which were separated by this cut as they are considered to be better for explanation of the dataset.

Parameters:
k: int

Number of features to select

Methods

fit(x, *rest)
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.
class fsfc.generic.SPEC.SPEC(k)[source]

Bases: fsfc.base.KBestFeatureSelector

SPEC-family of feature selection algorithms.

Every algorithm of this family creates graph representation of distribution of samples in a high-dimensional space. Samples becomes vertices and RBF-distance between them becomes weight of the edge. Algorithms use Spectral Graph Theory to calculate features scores.

Based on the article “Spectral feature selection for supervised and unsupervised learning.”.

Methods

fit(x, *rest)
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.

Lasso algorithm

class fsfc.generic.Lasso.Lasso(k, norm_constraint, eps=0.0001, max_iterations=100)[source]

Bases: fsfc.base.ClusteringFeatureSelector

Feature selection and clustering algorithm exploting the idea of L1-norm

Simultaneously does clustering and computes “importance” of each feature for it

Based on the article “A framework for feature selection in clustering.”

Algorithm does clustering and selects features in the following way:
  1. Assigns equal weights to every feature
  2. Finds clusters according to weights of features
  3. Computes objective of the method
  4. Optimizes L1-penalty for found objective
  5. Computes new feature weights using objective
  6. If new weights equal to the old ones, break. Otherwise repeat steps 2-6
  7. Select top k features according to weights
Parameters:
k: int

Number of features to select

norm_constraint: float

Constraint of L1-norm

eps: float (default 1e-4)

Precision of the algorithm

max_iterations: int (default 100)

Maximal number of iterations of algorithm

Methods

fit(x, *rest)
fit_predict(X[, y]) Performs clustering on X and returns cluster labels.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.

LFSBSS algorithm

class fsfc.generic.LFSBSS.LFSBSS(clusters, max_iterations=100)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.ClusterMixin

Localised Feature Selection, Based on Scattered Separability.

Selects features and simultaneously builds clustering in an iterative way. Every cluster has it’s local set of selected features, and we project input data to a subspace defined by it to predict cluster of a point.

This implementation doesn’t take into account importance of overlay of clusters and unassigned points for the sake of performance.

Based on the article “Localized feature selection for clustering.”.

Algorithm builds clustering in the following way:
  1. Find initial clustering using k-Means algorithm
  2. For each cluster:
    1. Find feature which can be dropped to improve the Scatter Separability Score
    2. Recompute clusters without this feature
    3. Find cluster that is the most similar to the current one
    4. Compute normalized value of Scatter Separability Score for two clusters - current and new
    5. If scores were improved, drop the feature and update clustering
  3. Repeat step 2 until no changes were made
  4. Return found clusters
Algorithm predicts clusters for new points in the following way:
  1. Project points to the feature subspace of each cluster
  2. Find the cluster whose center is the closest to projected point
Parameters:
clusters: int

Number of clusters to find

max_iterations: int (default 100)

Maximal number of iterations of the algorithm

Methods

fit(x) Fit algorithm to dataset, find clusters and set of features for every cluster
fit_predict(X[, y]) Performs clustering on X and returns cluster labels.
get_params([deep]) Get parameters for this estimator.
predict(x) Predict clusters for one sample
set_params(**params) Set the parameters of this estimator.
fit(x)[source]

Fit algorithm to dataset, find clusters and set of features for every cluster

Parameters:
x: ndarray

The dataset

Returns:
self: LFSBSS

Returns itself to support chaining

predict(x)[source]

Predict clusters for one sample

Parameters:
x: ndarray

Samples to predict

Returns:
label: int

Predicted cluster

MCFS algorithm

class fsfc.generic.MCFS.MCFS(k, clusters, p=8, sigma=1, mode='default', alpha=0.01)[source]

Bases: fsfc.base.KBestFeatureSelector

Multi-Class Feature selection algorithm.

Uses k-NN graph of samples in dataset and Spectral Graph Theory to find the most explaining features.

Based on the article “Unsupervised feature selection for multi-cluster data.”.

Algorithm selects features in the following way:

  1. Computes k-NN graph for the dataset.
  2. Computes heat matrix for this graph, degree and laplacian matrix.
  3. Solves eigen-problem: L y = lambda D y, selects K smallest eigen-values and corresponding vectors.
  4. Solves K regression problems, trying to predict every eigenvector by regression using dataset.
  5. Computes score of each feature using found regression coefficients.
  6. Select k features with the top scores.
Parameters:
k: int

Number of features to select.

clusters: int

Expected number of datasets.

p: int (default 8)

Number of nearest neighbours for construction of k-NN graph.

sigma: int (default 1)

Coefficient for computation of heat matrix.

mode: ‘default’ or ‘lasso’ (default ‘default’)

Type of penalty for the method: with ‘default’ algorithm uses no penalty, with ‘lasso’ it uses L1-penalty.

alpha: float (default 0.01)

Importance of penalty for algorithm with mode=’lasso’.

Methods

fit(x, *rest)
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.

Weighted K-means

class fsfc.generic.WKMeans.WKMeans(k, beta, eps=0.001, max_iterations=10)[source]

Bases: fsfc.base.ClusteringFeatureSelector

Weighted K-Means algorithm.

Assigns a weight parameter to every feature, runs N iterations of K-means and adjusts this parameter to explain the data better. Selects K best features according to their weights.

Based on the article “Automated variable weighting in k-means type clustering.”.

Algorithm selects features in the following way:
  1. Randomly assigns values for feature weights in a way that keeps their sum equal to 1.
  2. Find clusters using samples multiplied by weights in the power of beta
  3. Compute score for clustering.
  4. Recompute weights using approach from the article.
  5. Find clustering using new weights. If score of new clustering didn’t change, stop algorithm. Otherwise got to step 2.
  6. Select top k features according to their weights.
Parameters:
k: int

Number of features to select.

beta: float

Degree parameter for features weights.

eps: float (default 1e-3)

Precision of the algorithm.

max_iterations: int (default 10)

Maximal number of iterations of algorithm.

Methods

fit(x, *rest)
fit_predict(X[, y]) Performs clustering on X and returns cluster labels.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.

Methods for text data

FSFC contains implementation of some feature selection algorithms working with text data. Main difference of such algorithms is that they accept SciPy Sparse Matrices as input. They can be computed using vectorizers from sklearn

Every algorithm can be imported either from it’s package or from the fsfc.text module

Chi-R algorithm

class fsfc.text.CHIR.CHIR(k, clusters, alpha=0.1, max_iter=1000)[source]

Bases: fsfc.base.KBestFeatureSelector

Chi-R feature selection algorithm for text clustering.

Uses Chi-square statistics to evaluate the importance of each feature and R-coefficient that normalises statistics features across the corpus.

Based on the article “Text clustering with feature selection by using statistical data.”.

Algorithm selects features in the following way:
  1. Find initial clustering of dataset.
  2. Compute Chi-R scores for each feature according to the article.
  3. Select top k features according to scores.
  4. Set weights for top features equal to 1 and for others to alpha.
  5. Recompute clustering. If it changes, repeat steps 2-5, otherwise return weights of features.
Parameters:
k: int

Number of features to select.

clusters: int

Expected number of clusters.

alpha: float (default 0.1)

Value of weight of irrelevant feature.

max_iter: int (default 1000)

Maximal number of iterations of the algorithm.

Methods

fit(x, *rest) Fit algorithm to dataset and select relevant features.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.
fit(x, *rest)[source]

Fit algorithm to dataset and select relevant features.

Parameters:
x: csr_matrix

SciPy Sparse Matrix representing terms contained in every sample. May be created by vectorizers from sklearn. Preferred choice is TF-IDF vectorizer.

Returns:
self: FTC

Returns itself to support chaining.

Frequent Term-based Clustering

class fsfc.text.FTC.FTC(minsup)[source]

Bases: fsfc.base.ClusteringFeatureSelector

Frequent Terms-based Clustering algorithm. Uses frequent termsets to find clusters and simultaneously select features which determine every cluster.

Based on the article “Frequent term-based text clustering”.

FTS is a set of terms that appear in some part of all samples in dataset. We will say that FTS covers sample if every term of FTS is contained in the sample.

Algorithm does clustering in the following way:
  1. Find all FTS for dataset with specified minsup. Elements of FTS are terms, i.e. features of dataset.
  2. Find FTS that has the lowest Entropy Overlap with the rest clusters with respect to dataset. It’s shown in the paper that such FTS will explain data the best.
  3. Add this FTS to clustering, remove from dataset samples covered by it, repeat steps 2 and 3.
  4. Assign clusters to samples. Sample belongs to a cluster defined by FTS if FTS covers sample.
  5. Scores of features are 1 if feature belongs to any FTS and 0 otherwise.
Parameters:
minsup: float

Part of the dataset which should be covered by each FTS.

Methods

fit(x, *rest) Fit algorithm to dataset, find clusters and select relevant features.
fit_predict(X[, y]) Performs clustering on X and returns cluster labels.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.
fit(x, *rest)[source]

Fit algorithm to dataset, find clusters and select relevant features.

Parameters:
x: csr_matrix

SciPy Sparse Matrix representing terms contained in every sample. May be created by vectorizers from sklearn.

Returns:
self: FTC

Returns itself to support chaining.

Utility Algorithms

FSFC also has some utility algorithms which can also be useful. They may be imported from algorithm’s module, or directly from fsfc.utils module

Apriori algorithm

fsfc.utils.apriori.apriori(dataset, minspan)[source]

Apriori algorithm by Rakesh Agrawal and Ramakrishnan Srikant

Finds all frequent itemsets in the dataset with specified minspan. Itemset is a set of elements which appears in not less that minspan-part of the dataset.

Based on the article “Fast algorithms for mining association rules.”.

Parameters:
dataset: list

List of size n_samples whose elements are sets of integers. Each set represents a sample from the dataset.

minspan: float

MinSpan value. Algorithm will select sets of items that appear in not less than (MinSpan * n_samples) samples.

Returns:
itemsets: list

List of frequent itemsets. Every itemset is a list of integers in increasing order.

Indices and tables