Open Drug Discovery Toolkit (ODDT): a new open-source player in the drug discovery field


The Open Drug Discovery Toolkit (ODDT) is provided as a Python library to the cheminformatics
community. We have implemented many procedures for common and more sophisticated tasks,
and below we review in more detail the most prominent. We would also like to emphasize
that by making the code freely available through a BSD license, we encourage other
researchers and software developers to implement more modules, functions and support
of their own software.

Molecule formats

Open Drug Discovery Toolkit is designed to support as many formats as possible by
extending the use of Cinfony 13]. This common API unites different molecular toolkits, such as RDKit and OpenBabel,
and makes interacting with them more Python-like. All atom information collected from
underlying toolkits are stored as Numpy 14] arrays, which provide both speed and flexibility.

Interactions

The toolkit implements the most popular protein-ligand interactions. Directional interactions,
such as hydrogen bonds and salt bridges, have additional strict or crude terms that
indicate whether the angle parameters are within cutoffs (strict) or only certain
distance criteria are met (crude). The complete list of interactions implemented in
ODDT consists of hydrogen bonds, salt bridges, hydrophobic contacts, halogen bonds,
pi-stacking (face-to-face and edge-to-face), pi-cation, pi-metal and metal coordination.
These interactions are detected using in-house functions and procedures utilizing
Numpy vectorization for increased performance. Calculated interactions can be used
as further (re)scoring terms. Molecular features (e.g., H-acceptors and aromatic rings)
are stored as a uniform structure, which enables easy development of custom binding
queries.

Filtering

Filtering small molecules by properties is implemented in ODDT. Users can use predefined
filters such as RO5 15], RO3 16] and PAINS 17]. It is also possible to apply project-specific criteria for MW, LOGP and other parameters
listed in the toolkit documentation. See Example 1 in the “Results and discussion”
section for more details on how to use filtering.

Docking

Merging free/open source docking programs into a pipeline can be a frustrating experience
for many reasons. Some programs, like Autodock 18] and Autodock Vina 19], do not support multiple ligand inputs, where some other programs output scores to
separate files (e.g., GOLD 20]) or even directly print to the console. Additional effort is required for re-scoring
output ligand-receptor conformations in other software. Every in-silico discovery
project is flooded with custom procedures and scripts to share data between programs.
The docking stack within ODDT provides an easier path with the use of a common docking
API. This API allows retrieving output conformations and their scores from various
widely-used docking programs. The docking stack also supports multi-threading virtual
screening tasks independently of underlying software, helping to utilize all available
computational resources.

Scoring

Open Drug Discovery Toolkit provides a Python re-implementation of two machine learning-based
functions: NNscore (version 2) and RFscore. The training sets from its original publication
were used for the RFscore function 9]. For NNScore, neither the training set nor the training procedure was made available
by authors, other than a brief description 8]. To bring support for NNScore, we used ffnet  21]. The training procedure for NNscore was reimplemented in ODDT and should closely
reproduce the resulting ensemble of neural networks. The training data are stored
as csv files, which are used to train scoring functions locally. After the initial
training procedure, the scoring function objects are stored in pickle files for improved
performance.

Machine learning scoring functions consist of four main building blocks: descriptors,
model, training set and test set. ODDT provides a workflow for training new models,
with additional support for custom descriptors and custom training and test sets.
Such a design allows not only the use of the toolkit to reproduce scores (or reimplement
scoring functions) but also enables the researcher to develop their own custom scoring
procedures. Finally, if random seeds are defined, the scoring function results in
ODDT are fully reproducible.

The ability to assess the predictive performance of scoring function (or scoring procedures)
is of utmost importance. ODDT provides various ways to accomplish these tasks. One
approach may use the area under receiver operating characteristics curve (ROC AUC
and semi-log ROC AUC) and the enrichment factor (EF) at a defined percentage. These
methods can be applied for every scoring function (and their combination) when training/test
sets or active/inactive sets are supplied. Two other methods to test scoring function(s)
performance include internal k-folds and leave one out / leave p out (LOO/LPO) cross-validation, both of which are particularly useful to detect model
overfitting. These methods are available in ODDT through the sklearn python package
22].

Statistical methods

Modeling the relationship between chemical structural descriptors and compound activities
provides insight into SAR. Ultimately, such models may predict screening outcomes
of novel compounds, guiding future discovery steps. Because some screening data are
linear by their nature, simple regressors can be applied to find correlations (e.g.,
comparative molecular field analysis, CoMFA 23]). We implemented two straightforward regressions which that are widely used in cheminformatics,
both in ligand and structure-based methods: multiple linear regression and partial
least squares regression.

Nonlinear, more complex data are better assessed by machine learning models. Two forms
of machine learning models are particularly important in drug discovery: (1) regressors
for continuous data, such as IC50 values or inhibition rates, and (2) classifiers
applied to multiple bit-wise features or ligands tagged as active/inactive (e.g., NNScore 1.0). ODDT employs sklearn as the main machine
learning backend because it has a mature API and good performance. In some cases when
neural networks are required, ODDT mimics the sklearn API and instead uses ffnet 21]. The current version of our toolkit provides machine learning models that are widely
used in cheminformatics and drug discovery: (1) random forests, (2) support vector
machines, and (3) artificial neural networks (single and multilayer). These models
have been shown to provide great guidance when assessing protein-ligand complexes
in the development and application of various scoring functions 8]–10] and in SAR and QSAR (e.g., 24], 25]).