Chainer Chemistry: Chainer extension library for Biology and Chemistry¶
Chainer Chemistry is a collection of tools to train and run neural networks for tasks in biology and chemistry using Chainer .
Features¶
- State-of-the-art deep learning neural network models (especially graph convolutions) for chemical molecules (NFP, GGNN, Weave, SchNet etc.)
- Preprocessors of molecules tailored for these models
- Parsers for several standard file formats (CSV, SDF etc.)
- Loaders for several well-known datasets (QM9, Tox21 etc.)
Introductory to deep learning for molecules and Chainer Chemistry is also available here (SlideShare).
Installation¶
Dependency¶
Following packages are required to install Chainer Chemistry and are automatically installed when you install the library by pip command.
Also, it uses following library, which you need to manually install.
See the official document for installation.
If you have setup anaconda
, you may install rdkit
by following command:
$ conda install -c rdkit rdkit
Install from source¶
The tarball of the source tree is available via pip download chainer-chemistry
.
You can use setup.py
to install Chainer Chemistry from the tarball:
$ tar zxf chainer-chemistry-x.x.x.tar.gz
$ cd chainer-chemistry-x.x.x
$ python setup.py install
Install from the latest source from the master branch:
$ git clone https://github.com/pfnet-research/chainer-chemistry.git
$ pip install -e chainer-chemistry
Run example training code¶
The official repository provides examples of training several graph convolution networks. The code can be obtained by cloning the repository:
$ git clone https://github.com/pfnet-research/chainer-chemistry.git
The following code is how to train Neural Fingerprint (NFP) with the Tox21 dataset on CPU:
$ cd chainer-chemistry/examples/tox21
$ python train_tox21.py --method=nfp --gpu=-1 # set --gpu=0 if you have GPU
Tutorial¶
Abstract¶
In this tutorial, we predict Highest Occupied Molecular Orbital (HOMO) level of the molecules in QM9 dataset [1][2] by Neural Finger Print (NFP) [3][4]. We concentrate on exaplaining usage of Chainer Chemistry briefly and do not look over the detail of NFP implementation.
Tested Environment¶
- Chainer Chemistry >= 0.0.1 (See Installation)
- Chainer >= 2.0.2
- CUDA == 8.0, CuPy >= 1.0.3 (Required only when using GPU)
- For CUDA 9.0, CuPy >= 2.0.0 is required
- sklearn >= 0.17.1 (Only for preprocessing)
QM9 Dataset¶
QM9 is a publicly available dataset of small organic molecule structures and their simulated properties for data driven researches of material property prediction and chemical space exploration. It contains 133,885 stable small organic molecules made up of CHONF. The available properties are geometric, energetic, electronic, and thermodynamic ones.
In this tutorial, we predict HOMO level in the properties. Physically, we need quantum chemical calculations to compute HOMO level. From mathematical viewpoint it requires a solution of an internal eigenvalue problem for a Hamiltonian matrix. It is a big challenge to predict HOMO level accurately by a neural network, because the network should approximate both calculating the Hamiltonian matrix and solving the internal eigenvalue problem.
HOMO prediction by NFP¶
At first you should clone the library repository from GitHub.
There is a Python script examples/qm9/train_qm9.py
in the repository.
It executes a whole training procedure, that is, downloads QM9 dataset, preprocess it, define an NFP model and run trainning on them.
Execute the following commands on a machine satisfying the tested environment in environment.
~$ git clone git@github.com:pfnet-research/chainer-chemistry.git
~$ cd chainer-chemistry/examples/qm9/
Hereafter all shell commands should be executed in this directory.
If you are a beginner for Chainer, Chainer handson will greatly help you. Especially the explanation of inclusion relationship of Chainer classes in Sec. 4 in Chap. 2 is helpful when you read the sample script.
Next the dataset preparation part and the model definition part in train_qm9.py
are explained.
If you are not interested in them, skip Dataset Preparation and Model Definition, and jump to Run.
Dataset Preparation¶
Chainer Chemistry accepts the same dataset type with Chainer, such as chainer.datasets.SubDataset
.
In this section we learn how to download QM9 dataset and use it as a Chainer dataset.
The following Python script downloads and saves the dataset in .npz
format.
#!/usr/bin/env python
from chainer_chemistry import datasets as D
from chainer_chemistry.dataset.preprocessors import preprocess_method_dict
from chainer_chemistry.datasets import NumpyTupleDataset
preprocessor = preprocess_method_dict['nfp']()
dataset = D.get_qm9(preprocessor, labels='homo')
cache_dir = 'input/nfp_homo/'
os.makedirs(cache_dir)
NumpyTupleDataset.save(cache_dir + 'data.npz', dataset)
The last two lines save the dataset to input/nfp_homo/data.npz
and we need not to download the dataset next time.
The following Python script read the dataset from the saved .npz
file and split the data points into training and validation sets.
#!/usr/bin/env python
from chainer.datasets import split_dataset_random
from chainer_chemistry import datasets as D
from chainer_chemistry.dataset.preprocessors import preprocess_method_dict
from chainer_chemistry.datasets import NumpyTupleDataset
cache_dir = 'input/nfp_homo/'
dataset = NumpyTupleDataset.load(cache_dir + 'data.npz')
train_data_ratio = 0.7
train_data_size = int(len(dataset) * train_data_ratio)
train, val = split_dataset_random(dataset, train_data_size, 777)
print('train dataset size:', len(train))
print('validation dataset size:', len(val))
The function split_dataset_random()
returns a tuple of two chainer.datasets.SubDataset
objects (training and validation set).
Now you have prepared training and validation data points and you can construct chainer.iterator.Iterator
objects, needed for updaters in Chainer.
Model Definition¶
In Chainer, a neural network model is defined as a chainer.Chain
object.
Graph convolutional networks such as NFP are generally connection of graph convolution layers and multi perceptron layers.
Therefore it is convenient to define a class which inherits chainer.Chain
and compose two chainer.Chain
objects corresponding to the two kind of layers.
Execute the following Python script and check you can define such a class.
NFP
and MLP
are already defined chainer.Chain
classes.
#!/usr/bin/env python
import chainer
from chainer_chemistry.models import MLP, NFP
class GraphConvPredictor(chainer.Chain):
def __init__(self, graph_conv, mlp):
super(GraphConvPredictor, self).__init__()
with self.init_scope():
self.graph_conv = graph_conv
self.mlp = mlp
def __call__(self, atoms, adjs):
x = self.graph_conv(atoms, adjs)
x = self.mlp(x)
return x
n_unit = 16
conv_layers = 4
model = GraphConvPredictor(NFP(n_unit, n_unit, conv_layers),
MLP(n_unit, 1))
Run¶
You have defined the dataset and the NFP model on Chainer. There are no other procedures specific to Chainer Chemistry. Hereafter you should just follow the usual procedures in Chainer to execute training.
The sample script examples/qm9/train_qm9.py
contains all the procedures and you can execute training just by invoking the script.
The following command starts training for 20 epochs and reports loss and accuracy during training.
They are reported for each of main
(dataset for training) and validation
(dataset for validation).
The --gpu 0
option is to utilize a GPU with device id = 0.
If you do not have a GPU, set --gpu -1
or just drop --gpu 0
to use CPU for all the calculation.
In most cases, calculation with GPU is much faster than that only with CPU.
~/chainer-chemistry/examples/qm9$ python train_qm9.py --method nfp --label homo --gpu 0 # If GPU is unavailable, set --gpu -1
Train NFP model...
epoch main/loss main/accuracy validation/main/loss validation/main/accuracy elapsed_time
1 0.746135 0.0336724 0.680088 0.0322597 58.4605
2 0.642823 0.0311715 0.622942 0.0307055 113.748
(...)
19 0.540646 0.0277585 0.532406 0.0276445 1052.41
20 0.537062 0.0276631 0.551695 0.0277499 1107.29
After finished, you will find log
file in result/
directory.
Evaluation¶
In the loss and accuracy report, we are mainly interested in validation/main/accuracy
.
Although it decreases during training, the accuracy
field is actually mean absolute error.
The unit is Hartree.
Therefore the last line means validation mean absolute error is 0.0277499 Hartree.
See scaled_abs_error()
function in train_qm9.py
for the detailed definition of mean absolute error.
You can also train other type models like GGNN, SchNet or WeaveNet, and other target values like LUMO, dipole moment and internal energy, just by changing --model
and --label
options, respectively.
See output of python train_qm9.py --help
.
Using your own dataset¶
You can use your own dataset in Chainer Chemistry. example/own_dataset shows an example.
Reference¶
[1] L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.
[2] R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014.
[3] Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., & Adams, R. P. (2015). Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems (pp. 2224-2232).
[4] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., & Dahl, G. E. (2017). Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212.
Contribution guide¶
We welcome any type of contribution that helps to improve and promote Chainer Chemistry. Typical contribution includes:
- Send pull requests (PRs) to the repository (We recommend developers making PRs to read the Development policy before starting to implement).
- Report bugs or problems as issues.
- Send questions to developer community sites like Stackoverflow or Chainer Slack (en, jp).
- Write a blog post about Chainer Chemistry or its use case.
Development policy¶
In this section, we describe the development policy that the core developers follow. Developers who are thinking to send PRs to the repository are encouraged to read the following sections before starting implementation.
Versioning policy¶
Basically, we follow the semantic versioning v2.0.0. In Chainer Chemistry, public APIs in the sense of semantic versioning are ones in the document.
We follow these rules about versioning during the major version zero in addition to ones described in the the semantic versioning:
- We do not plan any scheduled releases.
- We do not plan any pre releases.
- We release the minor version when the core development team agrees. Typically, we do so when (1) sufficient number of features are added since the last minor release (2) the latest release cannot run the example code in the master branch of the repository (3) critical bugs are found. But we are not restricted to them.
- If we find critical bugs, we should release a patch version or a minor version that fixes them. The core development team will determine which version to release.
We do not have a concrete plan about versioning strategy after v1.0.0.
Compatibiity policy¶
As an immediate consequence of the semantic versioning, we may break compatibility of public APIs including addition, deletion, and changes in their semantics anytime in the major version zero. Since APIs of Chainer Chemistry are still immature and unstable, we expect introduction of new features can sometime involve compatibility break. If we are faced with a dilemma between cost for backward compatibility and benefit of new features, we are likely to give up the former because we want to place importance on introducing new features as soon as possible. Of course, we care backward compatibility whenever it is easy and low-cost.
Like ChainerCV, Chainer Chemistry provides several off-the-shelf deep learning models (e.g. Neural Finger Print) whose papers are available in such as arXiv or conferences related to machine learning. Although, most of published papers reports evaluation results of the models with publicly available datasets, we do NOT guarantee the reproducibility of experiments in the papers.
At some point, coding examples in the master branch of the official repository may not work even with the latest release. In that case, users are recommended to either use the example code of the latest release or update the library code to the master branch.
As of v0.3.0, we have introduced BaseForwardModel, which provides methods for serializing itself to and loading from a file. As these methods intenally use pickle, portability of the class depends on that of pickling. Especially, serialized instances of BaseForwardModel made with older Chainer Chemistry may not be loaded with newer one, partly because we may change their internal structures for refactoring, performance improvement, and so on. See the document of BaseForwardModel and their subclasses (e.g. Classifier, Regressor).
Branch strategy¶
The official repository of Chainer Chemistry is https://github.com/pfnet-research/chainer-chemistry. We use the master branch of the repository for development. Therefore, developer who makes PRs should send them to the master branch.
During major version zero, we do not maintain any released versions. When a bug is found, changes for the bug should be merged to the next version (either minor or patch). If the bug is critical, we will release the next version as soon as possible.
Coding guideline¶
We basically adopt PEP8 <https://www.python.org/dev/peps/pep-0008/>_ as a style guide. You can check it with flake8, which we can install by:
$ pip install flake8
and run with flake8
command.
In addition to PEP8, we use upper camel case (e.g. FooBar
) for class names and snake case (e.g. foo_bar
) for function, method, variable and package names.
Although we recommend developers to follow these rules as well, they are not mandatory.
For documents, we follow the Google Python Style Guide and compile it with Napoleon, which is an extension of Sphinx.
Testing guideline¶
Chainer Chemistry uses pytest as a unit-test framework.
All unit tests are located in tests/
directory. We can run tests with normal usage of pytest.
For example, the following command runs all unit tests:
$ pytest tests
Some unit tests require GPUs, which are annotated with @pytest.mark.gpu
.
Therefore, you can skip them with -m
option:
$ pytest -m "not gpu" tests
If a develop who write a unit test that uses GPUs, you must anotate it with @pytest.mark.gpu
.
Similarly, some unit tests take long time to complete.
We annotated them with @pytest.mark.slow
and can skip them with -m
option:
$ pytest -m "not slow" tests
Any unit test that uses GPUs muct be annotated with @pytest.mark.slow
.
We can skip both GPU and slow tests with the following command:
$ pytest -m "not (gpu or slow)" tests
Terminology¶
In the context of machine learning, especially chemoinformatics, we use several terms such as feature, feature vectors, descriptor and so on to indicate representation of inputs. To avoid disambiguity and align naming convention within the library code, we use these terms in the following way:
- Feature is a representation of a sample of interest (typically molecules in Chainer Chemistry).
- Label is a target value of we want to predict.
- Input feature is a representation of a sample from which we want to predict the target value.
For example, consider a suepervised learning task whose dataset consisting of input-output pairs ((x_1, y_1), ..., (x_N, y_N))
, where N
is the number of samples.
In Chainer Chemistry x_i` and ``y_i
are called input feature and label, respectively and a pair of (x_i, y_i)
is feature for each i
.
Relation to Chainer¶
Chainer is a deep learning framework written in Python that features dynamic computational graph construction (the “define-by-run” paradigm) for flexible and intuitive model development. As the name indicates, Chainer Chemistry is an extension library of Chainer built on top of it. The core development team members of Chainer and that of Chainer Chemistry work together tightly.
API Reference¶
Dataset¶
Converters¶
chainer_chemistry.dataset.converters.concat_mols |
Concatenates a list of molecules into array(s). |
Indexers¶
chainer_chemistry.dataset.indexer.BaseIndexer |
Base class for Indexer |
chainer_chemistry.dataset.indexer.BaseFeatureIndexer |
Base class for FeatureIndexer |
chainer_chemistry.dataset.indexers.NumpyTupleDatasetFeatureIndexer |
FeatureIndexer for NumpyTupleDataset |
Parsers¶
chainer_chemistry.dataset.parsers.BaseParser |
|
chainer_chemistry.dataset.parsers.CSVFileParser |
csv file parser |
chainer_chemistry.dataset.parsers.SDFFileParser |
sdf file parser |
chainer_chemistry.dataset.parsers.DataFrameParser |
data frame parser |
chainer_chemistry.dataset.parsers.SmilesParser |
smiles parser |
Preprocessors¶
Base preprocessors¶
chainer_chemistry.dataset.preprocessors.BasePreprocessor |
Base class for preprocessor |
chainer_chemistry.dataset.preprocessors.MolPreprocessor |
preprocessor class specified for rdkit mol instance |
Concrete preprocessors¶
chainer_chemistry.dataset.preprocessors.AtomicNumberPreprocessor |
Atomic number Preprocessor |
chainer_chemistry.dataset.preprocessors.ECFPPreprocessor |
|
chainer_chemistry.dataset.preprocessors.GGNNPreprocessor |
GGNN Preprocessor |
chainer_chemistry.dataset.preprocessors.NFPPreprocessor |
NFP Preprocessor |
chainer_chemistry.dataset.preprocessors.SchNetPreprocessor |
SchNet Preprocessor |
chainer_chemistry.dataset.preprocessors.WeaveNetPreprocessor |
WeaveNet must have fixed-size atom list for now, zero_padding option |
chainer_chemistry.dataset.preprocessors.RelGATPreprocessor |
RelGAT Preprocessor |
chainer_chemistry.dataset.preprocessors.RelGCNPreprocessor |
RelGCN Preprocessor |
chainer_chemistry.dataset.preprocessors.RSGCNPreprocessor |
RSGCN Preprocessor |
Utilities¶
chainer_chemistry.dataset.preprocessors.MolFeatureExtractionError |
|
chainer_chemistry.dataset.preprocessors.type_check_num_atoms |
Check number of atoms in mol does not exceed num_max_atoms |
chainer_chemistry.dataset.preprocessors.construct_atomic_number_array |
Returns atomic numbers of atoms consisting a molecule. |
chainer_chemistry.dataset.preprocessors.construct_adj_matrix |
Returns the adjacent matrix of the given molecule. |
Splitters¶
chainer_chemistry.dataset.splitters.RandomSplitter |
Class for doing random data splits. |
chainer_chemistry.dataset.splitters.StratifiedSplitter |
Class for doing stratified data splits. |
chainer_chemistry.dataset.splitters.ScaffoldSplitter |
Class for doing data splits by chemical scaffold. |
Datasets¶
Dataset implementations¶
chainer_chemistry.datasets.NumpyTupleDataset |
Dataset of a tuple of datasets. |
Dataset loaders¶
chainer_chemistry.datasets.tox21.get_tox21 |
Downloads, caches and preprocesses Tox21 dataset. |
chainer_chemistry.datasets.qm9.get_qm9 |
Downloads, caches and preprocesses QM9 dataset. |
chainer_chemistry.datasets.molnet.get_molnet_dataset |
Downloads, caches and preprocess MoleculeNet dataset. |
chainer_chemistry.datasets.molnet.get_molnet_dataframe |
Downloads, caches and get the dataframe of MoleculeNet dataset. |
Functions¶
Function implementations¶
chainer_chemistry.functions.matmul |
Computes the matrix multiplication of two arrays. |
chainer_chemistry.functions.mean_squared_error |
Mean squared error function. |
chainer_chemistry.functions.mean_absolute_error |
Mean absolute error function. |
chainer_chemistry.functions.r2_score |
Computes R^2(coefficient of determination) regression score function. |
Iterators¶
Iterator Implementations¶
chainer_chemistry.iterators.BalancedSerialIterator |
Dataset iterator that serially reads the examples with balancing label. |
chainer_chemistry.iterators.IndexIterator |
Index iterator |
Links¶
Link implementations¶
chainer_chemistry.links.EmbedAtomID (out_size) |
Embeddning specialized to atoms. |
chainer_chemistry.links.GraphLinear (in_size) |
Graph Linear layer. |
chainer_chemistry.links.GraphBatchNormalization ([…]) |
Graph Batch Normalization layer. |
Scaler implementations¶
chainer_chemistry.links.StandardScaler () |
Update implementations¶
chainer_chemistry.links.GGNNUpdate ([…]) |
GGNN submodule for update part. |
chainer_chemistry.links.NFPUpdate (…[, …]) |
NFP submodule for update part. |
chainer_chemistry.links.RelGATUpdate (…[, …]) |
RelGAT submodule for update part. |
chainer_chemistry.links.RelGCNUpdate (…[, …]) |
RelGUN submodule for update part. |
chainer_chemistry.links.RSGCNUpdate (…) |
RSGCN submodule for message and update part. |
chainer_chemistry.links.SchNetUpdate ([…]) |
Readout implementations¶
chainer_chemistry.links.GeneralReadout ([…]) |
General submodule for readout part. |
chainer_chemistry.links.GGNNReadout (out_dim) |
GGNN submodule for readout part. |
chainer_chemistry.links.NFPReadout (…) |
NFP submodule for readout part. |
chainer_chemistry.links.SchNetReadout ([…]) |
SchNet submodule for readout part. |
Models¶
Model implementations¶
chainer_chemistry.models.NFP |
Neural Finger Print (NFP) |
chainer_chemistry.models.GGNN |
Gated Graph Neural Networks (GGNN) |
chainer_chemistry.models.MLP |
Basic implementation for MLP |
chainer_chemistry.models.SchNet |
See Kristof et al, SchNet: A continuous-filter convolutional neural network for modeling |
chainer_chemistry.models.WeaveNet |
WeaveNet implementation |
chainer_chemistry.models.RelGAT |
Relational Graph Attention Networks (GAT) |
chainer_chemistry.models.RelGCN |
Relational GCN (RelGCN) |
chainer_chemistry.models.RSGCN |
Renormalized Spectral Graph Convolutional Network (RSGCN) |
Wrapper models¶
chainer_chemistry.models.BaseForwardModel |
A base model which supports forward functionality. |
chainer_chemistry.models.Classifier |
A simple classifier model. |
chainer_chemistry.models.Regressor |
A simple regressor model. |
Utilities¶
Training¶
Extensions¶
chainer_chemistry.training.extensions.batch_evaluator.BatchEvaluator |
|
chainer_chemistry.training.extensions.roc_auc_evaluator.ROCAUCEvaluator |
Evaluator which calculates ROC AUC score |
chainer_chemistry.training.extensions.prc_auc_evaluator.PRCAUCEvaluator |
Evaluator which calculates PRC AUC score |