chemfp 4.1 documentation

chemfp is a set of command-line tools and a Python package for working with binary cheminformatics fingerprints, typically between several hundred and a few thousand bits long.

This is the documentation for chemfp 4.1. It is a work-in-progress and currently only covers the major changes since chemfp 3.5. While chemfp 4.1 is nearly completely backwards compatible with chemfp 3.5, the recent additions change and (hopefully!) improve how to use the chemfp API.

Chemfp 4.1 was released on 17 May 2023. with CXSMILES support, methods to save and load similarity search results to a SciPy sparse matrix, Butina clustering, methods to work with CSV files, and a tools to convert between structure formats.

It was tested with Python 3.8, 3.9, 3.10, and 3.11. It requires the “click” and “tqdm” third-party packages, which should be installed automatically as part of the normal installation process. Some optional features will only work if they are installed by other methods, like the NumPy, SciPy, and Pandas integration.

Chemfp 4.0 added new methods for diversity selection and improves API usability with new high-level functions and improved feedback for interactive use (including progress bars!).

For details on the older parts of the API, please also consult the chemfp 3.5 documentation.

Remember: chemfp cannot generate fingerprints from a structure file without a third-party chemistry toolkit. The supported toolkits are OEChem/OEGraphSim, Open Babel, RDKit and CDK (via the JPype adapter).

Table of Contents

Background

Chemfp started because around 2007 a project I worked on needed a way to include nearest-neighbor information for a property prediction calculator. The cheminformatics toolkits at the time didn’t include that as a built-in tool, though they did supply the components to build your own. Indeed, asking showed that nearly everyone had built their own similar sorts of tools, each with a different format, and varying levels of performance that were nowhere near the hardware limits.

The first step was to develop the FPS format, a human-readable text-based exchange format for fingerprint data that is easy for software to read and write. It stores fingerprint records containing a hex-encoded fingerprint and a record identifier, as well as metadata like the associated fingerprint type.

People don’t use an alternate format just because it exists, so the next step was to develop useful command-line tools for fingerprint generation and similarity search, as well as a Python API for working with fingerprints in a discovery setting - like adding similarity search to a web app! Alternatively, the sdf2fps tool can extract fingerprint data from an SDF tag field.

People don’t use alternative tools just because they exist, so the third step was improve similarity search performance. This was done by improving the search algorithm and implementation and adding multi-threaded support for NxN or NxM search cases. Similarity search with modern chemfp is over 10x faster than chemfp 1.0!

Similarity search is fast enough that for many cases the FPS read performance became the limiting factor. This is especially noticable in web development where modern practices restart the web app after every change. The FPB binary format was developed as a way to quickly load a fingerprint dataset. Its internal layout is identical to what’s needed for a similarity search so the load step needs little additional processing. The fpcat program converts between the FPS and FPB formats.

Chemfp supports four different cheminformatics toolkits, which are used for molecule I/O and fingerprint generation. One of the goals of the chemfp API is to make it easy to work with fingerprints from different toolkits without learning the details of each toolkit. In the usual computer science fashion, this is done with the “toolkit wrapper API”, which gives a consistent API across the supported toolkits.

The”text toolkit” implements a subset of this API, to work with SDF and SMILES files as text records. The text toolkit also includes special support for working with SDF files, for example, to add fingerprint data as tag data to an SDF record without round-tripping the record through a chemistry toolkit.

With the 4.0 release, chemfp added support for diversity, including MaxMin, sphere exclusion, and heapsweep. Rather than add a new command-line program for each new tool, the “chemfp” command-line tool was added, with subcommands for each tool. The 4.1 release added the “butina”, “csv2fps” and “translate” subcommands, along with Python API additions for clustering and CSV processing.

Citation

For a different, more scholarly discussion of chemfp see “The chemfp project” in the Journal of Cheminformatics. That paper covers the purpose of the project, its architecture and design, the FPS and FPB file formats, and the experience in trying to run chemfp as a self-funded open source project.

To cite chemfp use: Dalke, A. The chemfp project. J Cheminform 11, 76 (2019). https://doi.org/10.1186/s13321-019-0398-8 .

Thanks

In no particular order, the following contributed to chemfp in some way: Noel O’Boyle, Geoff Hutchison, the Open Babel developers, Greg Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, Lionel Uran Landaburu, Sereina Riniker, Brian Cole, John Mayfield, Jeff van Santen, and Jakub Gunera.

Thanks also to my wife, Sara Marie, for her many years of support.

Indices and tables