chemfp.highlevel.clustering module¶
This module should not be imported directly.
It contains internal implementation details of the high-level API available from the top-level chemfp module.
This module is included in the documentation because parts of this module are returned to the user, and are part of the public API.
-
class
chemfp.highlevel.clustering.
ButinaClusters
(arena, matrix, seed, NxN_threshold, butina_threshold, tiebreaker, false_singletons, num_butina_clusters, rescore, picker, result, times, _arena_close, fingerprints_filename, matrix_filename)¶ Bases:
object
The results of chemfp.butina() with query details, search results, and timing information.
The available properties are:
arena
- the fingerprint arena, based on the input fingerprintsmatrix
- the NxN similarity matrix, based on the input matrixseed
- the seed for the RNGNxN_threshold
- the NxN similarity thresholdbutina_threshold
- the Butina algorithm minimum similarity thresholdtiebreaker
- the specified tiebreaker methodfalse_singletons
- the specified method for handling false singletonsnum_butina_clusters
- the specified maximum number of clusters, or Nonerescore
- the flag value used to request that reassigned fingerprints be re-scoredpicker
- the underlying Butina picker objectresult
- the underlying Butina clustering resultstimes
- a breakdown of the times for the search as a dictonary mapping task to elapsed time in seconds, or None if it wasn’t relevant. “load_arena” and “load_matrix” are times needed to load the arena and matrix, respectively, with “load” as the total load time. “NxN” is the time needed to compute the NxN matrix. The “cluster”, “prune”, and “rescore” times are self-explanatory. “total” is the total time for the butina call.fingerprints_filename
- the value of fingerprints, if it is a stringmatrix_filename
- the value of matrix, if it is a string
-
all_clusters
¶ The full list of Butina clusters, ordered by cluster index.
The clusters are ordered by cluster index and may include empty clusters, due to moving false singletons or pruning the number of clusters.
-
as_ctypes
()¶ Return the assignments as a ctypes array
-
as_numpy
()¶ Return the assignments as a NumPy array
-
assignments
¶ Return the assignments as a
ButinaAssignments
-
close
()¶ Release any assigned resources, like a memory-mapped FPB arena
-
clusters
¶ The final list of clusters.
This list is ordered from largest to smallest.
-
get_description
(include_times=False)¶ Return a human-readable description of the Butina clustering
-
get_metadata
()¶ Return a dictionary containing entries for output metadata lines
-
get_times_description
()¶ Return a human-readable break-down of the Butina compute times
-
get_type
()¶ Get the ‘type’ string describing the Butina search parameters
-
save
(destination=None, *, format=None, renumber=True, rename=True, include_members=True, metadata=None, include_metadata=True, precision=None)¶ Save the clusters to destination in one of several formats.
The supported formats are “centroid”, “flat”, “csv”, and “tsv”. If unspecified, infer the format from the destination filename extension. If the extensions is not known, use “centroid”.
If renumber is True (the default) then the clusters are renumbered sequentially starting from 1. If False then used internal cluster index, which starts from 0 and skips empty clusters.
If rename is True (the default) then rename the member types to either “CENTER” or “MEMBER”. If False, use the internal type names.
If include_members is True (the default) then include cluster members in the output.
If metadata is not None then it must a dictionary used for the metadata lines. The keys and values must be encoded appropriately. (No tab, NUL, or newline character, and the key must not contain an equals sign.)
If include_metadata is True (the default) then include metadata information in the output file.
If precision is None then use the minimum number of decimal places needed to distinguish between two scores. This value depends on the number of bits in the fingerprint. Otherwise it must be an integer between 1 and 10, inclusive.
-
to_pandas
(*, columns=['cluster', 'id', 'type', 'score'], rename=True, renumber=True, sort=True)¶ Return the assignments as a pandas DataFrame
The DataFrame contains four columns, one for each input fingerprint:
- cluster is the cluster index
- id is the identifier from the input matrix
- type is a string like CENTER” or “MEMBER”
- score the Tanimoto score
Use columns to specify different column labels.
By default the assignment types are relabled to use only “CENTER” and “MEMBER”. If rename is False then the full internal labels are used.
By default the cluster indices are renumbered to the contiguous values 1..N where N is the number of clusters. If renumber is False then the internal cluster indices are used, which start from 0 and may skip indices for empty clusters whose elements were moved to other clusters.
Parameters: - columns (a list of two strings) – column names for the returned DataFrame
- rename (bool) – if False use the internal type names rather then using only “CENTER” and “MEMBER”
- renumber (bool) – if False use the internal cluster ids
Returns: a pandas DataFrame