chemfp.arena module¶
Algorithms and data structure for working with a FingerprintArena.
This is an internal chemfp module. It should not be imported by programs which use the public API. (Let me know if anything else should be part of the public API.)
This module contains class definitions for a objects which are returned
as part of the public API. A FingerprintArena
stores
fingerprints in a contiguous block of memory, along with their
associated ids. A FingerprintList
provides a list-like view to
the fingerprints.
-
class
chemfp.arena.
FingerprintArena
(metadata, alignment, start_padding, end_padding, storage_size, arena, popcount_indices, arena_ids, start=0, end=None, id_lookup=None, num_bits=None, num_bytes=None, license_key=b'')¶ Bases:
chemfp.FingerprintReader
Store fingerprints in a contiguous block of memory for fast searches
A FingerprintArena implements the
chemfp.FingerprintReader
API.The fingerprints in a continuous block of memory so the per-molecule overhead is very low. The block is named
arena
. The first fingerprint starts at the offsetstart_padding
and each fingerprint takesstorage_size
bytes, so fingerprint i is located at:self.arena[self.start_padding + i * self.storage_size: self.start_padding + (i+1) * self.storage_size ]
The fingerprints can be sorted by popcount, so the fingerprints with no bits set come first, followed by those with 1 bit, etc. If
self.popcount_indices
is a non-empty string then the string contains information about the start and end offsets for all the fingerprints with a given popcount. This information is used for the BitBound search algorithm.The public attributes are:
- metadata - a
chemfp.Metadata
with information about the fingerprints. - ids - list of identifiers, in index order
- fingerprints - a
FingerprintList
list-like view of the fingerprints, - in index order
- fingerprints - a
Other attributes, which might be subject to change, and which I won’t fully explain, are:
- arena - a contiguous block of memory, which contains the fingerprints
- start_padding - number of bytes to the first fingerprint in the block
- end_padding - number of bytes after the last fingerprint in the block
- storage_size - number of bytes used to store a fingerprint
- num_bytes - number of bytes in each fingerprint (must be <= storage_size)
- num_bits - number of bits in each fingerprint
- alignment - the fingerprint alignment
- start - the index for the first fingerprint in the arena/subarena
- end - the index for the last fingerprint in the arena/subarena
- arena_ids - all of the identifiers for the parent arena
The FingerprintArena is its own context manager, but it does nothing on context exit. The derived FPBFingerprintArena may use a memory-mapped FPB file, which will be closed by the context manager or by an explicit call to close().
-
alignment
= None¶ the fingerprint alignment
-
arena
= None¶ a contiguous block of memory, which contains the fingerprints.
-
arena_ids
= None¶ list of identifiers for the parent arena. You likely want to use
ids
, which contains the ids for this arena.
-
close
()¶ Close any resources associated with this arena
If the arena uses a memory-mapped file (eg, an FPB file) then this will close the file.
-
copy
(indices=None, reorder=None, metadata=None, ids=None)¶ Create a new arena using either all or some of the fingerprints in this arena
By default this create a new arena. The fingerprint data block and ids may be shared with the original arena, which makes this a shallow copy. If the original arena is a slice, or “sub-arena” of an arena, then the copy will allocate new space to store just the fingerprints in the slice and use its own list for the ids.
The indices parameter, if not None, is an iterable which contains the indicies of the fingerprint records to copy. Duplicates are allowed, though discouraged.
If indices are specified then the default reorder value of None, or the value True, will reorder the fingerprints for the new arena by popcount. This improves overall search performance. If reorder is False then the new arena will preserve the order given by the indices.
If indices are not specified, then the default is to preserve the order type of the original arena. Use
reorder=True
to always reorder the fingerprints in the new arena by popcount, andreorder=False
to always leave them in the current ordering.>>> import chemfp >>> arena = chemfp.load_fingerprints("pubchem_queries.fps") >>> arena.ids[1], arena.ids[5], arena.ids[10], arena.ids[18] (b'9425031', b'9425015', b'9425040', b'9425033') >>> len(arena) 19 >>> new_arena = arena.copy(indices=[1, 5, 10, 18]) >>> len(new_arena) 4 >>> new_arena.ids [b'9425031', b'9425015', b'9425040', b'9425033'] >>> new_arena = arena.copy(indices=[18, 10, 5, 1], reorder=False) >>> new_arena.ids [b'9425033', b'9425040', b'9425015', b'9425031']
If metadata is not None then it will be the metadata of the new copy.
Use ids to specify the identifiers for the new copy. It is especially useful a way to preserve the initial fingerprint index in the original arena.
Parameters: - indices (iterable containing integers, or None) – indicies of the records to copy into the new arena
- reorder (True to reorder, False to leave in input order, None for default action) – describes how to order the fingerprints
- metadata (a
chemfp.Metadata
or None) – the metadata to use in the new copy - ids (a list of values, or None to keep the original identifiers) – replacement identifiers to use in the copy
Returns: a new
FingerprintArena
-
count_tanimoto_hits_arena
(queries, threshold=0.7)¶ Count the fingerprints which are sufficiently similar to each query fingerprint
Returns a list containing a count for each query fingerprint in the queries arena. The count is the number of fingerprints in the arena which are at least threshold similar to the query fingerprint.
The order of results is the same as the order of the queries.
Parameters: - queries (a
FingerprintArena
) – query fingerprints - threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: list of integer counts, one for each query
- queries (a
-
count_tanimoto_hits_fp
(query_fp, threshold=0.7)¶ Count the fingerprints which are sufficiently similar to the query fingerprint
Return the number of fingerprints in the arena which are at least threshold similar to the query fingerprint query_fp.
Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: integer count
-
count_tversky_hits_fp
(query_fp, threshold=0.7, alpha=1.0, beta=1.0)¶ Count the fingerprints which are sufficiently similar to the query fingerprint
Return the number of fingerprints in the arena which are at least threshold similar to the query fingerprint query_fp.
Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: integer count
-
end
= None¶ if a subarena, one more than the index of the last fingerprint relative to the start of the parent arena. Will be the number of total fingerprints if this is not a subarena.
-
end_padding
= None¶ the number of bytes after the last fingerprint in the block
-
fingerprints
= None¶ list-list view of the fingerprints
-
get_bit_counts
()¶ Count the number of on bits for each position in the fingerprint
This function returns an array.array of length num_bits integers. Use get_bit_counts_as_numpy() to return a NumPy array.
Returns: an array.array of length num_bits with 4-byte signed integers
-
get_bit_counts_as_numpy
()¶ Count the number of on bits for each position in the fingerprint
This function returns an NumPy array of length num_bits integers. Use get_bit_counts() to return an array.array.
Returns: a NumPy array of length num_bits and type int32
-
get_by_id
(id)¶ Given the record identifier, return the (id, fingerprint) pair,
If the id is not present then return None.
-
get_fingerprint
(i)¶ Return the fingerprint at index i
Raises an IndexError if index i is out of range.
-
get_fingerprint_by_id
(id)¶ Given the record identifier, return its fingerprint
If the id is not present then return None
-
get_index_by_id
(id)¶ Given the record identifier, return the record index.
If the id is not present then return None.
-
ids
¶ Return the identifiers in this arena or subarena.
-
iter_arenas
(arena_size=1000)¶ Base class for all chemfp objects holding fingerprint records
All FingerprintReader instances have a
metadata
attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record.
-
knearest_tanimoto_search_arena
(queries, k=3, threshold=0.0)¶ Find the k-nearest fingerprints which are sufficiently similar to each of the query fingerprints
For each fingerprint in the queries arena, find the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a
SearchResults
, where the hits in eachSearchResult
are sorted by similarity score.Parameters: - queries (a
FingerprintArena
) – query fingerprints - threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
Returns: - queries (a
-
knearest_tanimoto_search_fp
(query_fp, k=3, threshold=0.0)¶ Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a
SearchResult
, sorted from highest score to lowest.Parameters: - queries (a
FingerprintArena
) – query fingerprints - threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
Returns: - queries (a
-
knearest_tversky_search_fp
(query_fp, k=3, threshold=0.0, alpha=1.0, beta=1.0)¶ Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a
SearchResult
, sorted from highest score to lowest.Parameters: - query_fp (byte string) – query fingerprint
- k (positive integer) – maximum number of neighbors to find
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
- alpha (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
- beta (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
Returns:
-
metadata
= None¶ a
chemfp.Metadata
with information about the fingerprints.
-
num_bits
= None¶ the number of bits in each fingerprint
-
num_bytes
= None¶ the number of bytes in each fingerprint (must be <=
storage_size
)
-
popcount_indices
= None¶ encoded byte string containing the fingerprint index for the first fingerprint with a given popcount p.
-
random_choice
(rng=None)¶ return a randomly selected (id, fp) pair
If rng is None then use Python’s
random.sample()
for the sampling. If rng is an integer then userandom.Random(rng).sample()
. Otherwise, userng.sample()
.Parameters: rng (None, int, or a random.Random()) – method to use for random sampling Returns: a 2-element tuple of identifier string and fingerprint bytes
-
sample
(num_samples, reorder=True, rng=None)¶ return a new arena containing num_samples randomly selected fingerprints, without replacement
If num_samples is an integer then it must be between 0 and the size of the arena. If num_samples is a float then it must be between 0.0 and 1.0 and is interpreted as the proportion of the arena to include.
By default the new arena is sorted by popcount. Set reorder to False to return the fingerprints in random order.
If rng is None then use Python’s
random.sample()
for the sampling. If rng is an integer then userandom.Random(rng).sample()
. Otherwise, userng.sample()
.Parameters: - num_samples (int or float) – number of fingerprints to select
- reorder (True to reorder, False to leave in the sampling order) – describes how to order the sampled fingerprints
- rng (None, int, or a random.Random()) – method to use for random sampling
Returns:
-
save
(destination, format=None, level=None)¶ Save the fingerprints to a given destination and format
The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.
If the output format is “fps”, “fps.gz”, or “fps.zst” then destination may be a filename, a file object, or None; None writes to stdout.
If the output format is “fpb” then destination must be a filename or seekable file object. Chemfp cannot save to compressed FPB files.
Parameters: - destination (a filename, file object, or None) – the output destination
- format (None, "fps", "fps.gz", "fps.zst", or "fpb") – the output format
- level (an integer, or "min", "default", or "max" for compressor-specific values) – compression level when writing .gz or .zst files
Returns: None
-
start_padding
= None¶ the number of bytes before the first fingerprint in the block
-
storage_size
= None¶ the number of bytes used to store a fingerprint
-
threshold_tanimoto_search_arena
(queries, threshold=0.7)¶ Find the fingerprints which are sufficiently similar to each of the query fingerprints
For each fingerprint in the queries arena, find all of the fingerprints in this arena which are at least threshold similar. The hits are returned as a
SearchResults
, where the hits in eachSearchResult
is in arbitrary order.Parameters: - queries (a
FingerprintArena
) – query fingerprints - threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: - queries (a
-
threshold_tanimoto_search_fp
(query_fp, threshold=0.7)¶ Find the fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a
SearchResult
, in arbitrary order.Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:
-
threshold_tversky_search_fp
(query_fp, threshold=0.7, alpha=1.0, beta=1.0)¶ Find the fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a
SearchResult
, in arbitrary order.Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:
-
to_numpy_array
()¶ Get the fingerprint bytes in a chemfp arena as NumPy uint8 array.
A chemfp arena stores fingerprints in a contiguous byte string. This function returns a 2D NumPy array which is a view of that string. The array has len(arena) rows and arena.storage_size columns.
The storage size may be larger than the minimum number of bytes in the fingerprint because of zero padding used to improve performance. For example, the 166-bit MACCS keys uses 24 bytes of storage when only 21 bytes are needed, because then chemfp can use the fast POPCNT instruction when computing the Tanimoto.
To remove extra padding bytes, use NumPy indexing to copy the fingerprint bytes to a new array:
arr[:,0:arena.num_bytes]
The last column of this new array may contain padding bits if the number of bits in a fingerprint is not a multiple of 8.
Warning
Do not attempt to access the contents of a NumPy view of a FPBFingerprintArena (the arena from an FPB file) after the FPB file has been closed as that will likely cause a segmentation fault or other severe failure.
Returns: a NumPy array of type uint8
-
to_numpy_bitarray
(bitlist=None)¶ Get the fingerprint bits in a chemfp arena as NumPy uint8 array.
This function returns a 2D NumPy array with len(arena) rows and one column for each bit. The default returns arena.num_bits columns, where column 0 is the first bit, etc. Use bitlist to specify the indicies of which columns to return. Negative indices are supported; -1 is the last bit, -2 is the second to last. Out of range indices raise an IndexError.
Parameters: bitlist (iterable of integers) – bit column indices to use (default: all bits) Returns: a NumPy array of type uint8
-
train_test_split
(train_size=None, test_size=None, reorder=True, rng=None)¶ return arenas containing train_size and test_size randomly selected fingerprints, without replacement
If train_size is an integer then it must be between 0 and the size of the arena. If train_size is a float then it must be between 0.0 and 1.0 and is interpreted as the proportion of the arena to include. If train_size is None then it is set to the complement of test_size. If both train_size and test_size are None then the default train_size is 0.75.
If test_size is an integer then it must be between 0 and the size of the arena. If test_size is a float then it must be between 0.0 and 1.0 and is interpreted as the proportion of the arena to include. If test_size is None then it is set to the complement of train_size. If both test_size and train_size are None then the default test_size is 0.25.
By default the new arena is sorted by popcount. Set reorder to False to return the fingerprints in random order.
If rng is None then use Python’s
random.sample()
for the sampling. If rng is an integer then userandom.Random(rng).sample()
. Otherwise, userng.sample()
.This method API is modelled on scikit-learn’s model_selection.train_test_split() function, described at: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Parameters: - train_size (int, float, or None) – number of fingerprints for the training set arena
- test_size (int, float, or None) – number of fingerprints for the test set arena
- reorder (True to reorder, False to leave in the sampling order) – describes how to order the sampled fingerprints
- rng (None, int, or a random.Random()) – method to use for random sampling
Returns: a training set
FingerprintArena
and a test setFingerprintArena
- metadata - a
-
class
chemfp.arena.
FingerprintList
(start_padding, storage_size, arena, start, end, num_bytes)¶ Bases:
collections.abc.Sequence
A read-only list-like view of the arena fingerprints
This implements the standard Python list API, including indexing and iteration.
Note: fingerprint searches like “fp in fingerprint_list” and “fingerprint_list.index(fp)” are not fast.
-
random_choice
(rng=None)¶ Return a randomly selected fingerprint.
If rng is None then use Python’s
random.sample()
for the sampling. If rng is an integer then userandom.Random(rng).sample()
. Otherwise, userng.sample()
.Parameters: rng (None, int, or a random.Random()) – method to use for random sampling Returns: a 2-element tuple of identifier string and fingerprint bytes
-