BgN-Score and BsN-Score: Bagging and boosting based ensemble neural networks scoring functions for accurate binding affinity prediction of protein-ligand complexes

Protein-ligand complex database

We used the same complex database that Cheng et al. used as a benchmark in their recent
comparative assessment of sixteen popular SFs 14]. They obtained a refined set containing high-quality 3D structures of 1300 protein-ligand complexes from the 2007
version of PDBbind 9]. From this set, the curators of PDBbind built a test set that encompasses 65 different
protein families, each of which binds to three different ligands to form a set of
195 unique protein-ligand complexes. This is called the core set and is mainly intended to be used for benchmarking docking and scoring systems. In
order to be consistent with the comparative framework used to assess SFs in 14], we too consider the 2007 version of PDBbind. We use the core set as a test set in
this work and denote it by Cr. A primary training set, denoted by Pr, was formed by removing all Cr complexes from the total 1300 complexes in the refined set of PDBbind. As a result,
Pr contains 1105 complexes that are completely disjoint from Cr complexes.

Protein-ligand complex characterization

For each protein-ligand complex, we extracted physicochemical features used in the
empirical SFs X-Score 12] (a set of 6 features denoted by X) and AffiScore 31] (a set of 30 features denoted by A) and calculated by GOLD 32] (a set of 14 features denoted by G), and geometrical features used in the ML SF RF-Score 26] (a 36-feature set denoted by R). The software packages that calculate X-Score, AffiScore (from SLIDE), and RF-Score
features were available to us in an open-source form from their authors and a full
list of these features is provided in the appendix of 17]. The GOLD docking suite provides a utility that calculates a set of general descriptors
for both molecules as separate entities and in a complex form. The full set of these
features can be easily accessed and calculated via the Descriptors menu in GOLD. By considering all fifteen combinations of these four types of features
(i.e., X, A, R, G, X ? A, X ? R, X ? G, A ? R, A ? G, R ? G, X ? A ? R, X ? A ? G, X ? R ? G, A ? R ? G, and X ? A ? R ? G), we generated 15 versions of the Pr and Cr data sets, which we distinguish by using apropriate subscripts identifying the features
used. For instance, PrXR denotes the version of Pr comprising the set of features X ? R (referred to simply as XR) and experimentally-determined binding affinity data for
complexes in the Pr dataset.

Artificial neural networks

Computational methodologies inspired by networks of biological neurons, Artificial
Neural Networks (ANNs), are employed in this work. Neural networks (NNs) have been
applied in several drug design applications for both regression and classification
problems 18,19,21]. Our ensemble approaches are based on feed-forward-back-propagation (FFBP) neural
networks implemented in the R language package nnet 33]. Neural networks we fit using nnet are composed of an input layer that contains neurons corresponding to features extracted
for complexes, an arbitrary number of neurons (20 in our experiments) in the hidden
layer, and an output neuron for the output layer. These neurons are interconnected
via weighted links as shown in Figure 1. The outputs of the input neurons are directed to all the neurons in the hidden layer.
The outputs of the hidden layer neurons are also directed forward to the output neuron.
The output of a network is calculated at its output neuron according to the formula:

Figure 1. Multi-layered perceptron, feed-forward neural network used to predict the binding
affinity of a protein-ligand complex characterized by a set of features
. This model represents SNN-Score, the single neural network scoring function we build.

(1)

where is a feature vector representing a protein-ligand complex characterized by a feature
set P, f(xP) is the function that maps it to binding affinity , O is the activation function of the output neuron (linear in our case and defined simply
as O(u) = u), H + 1 is the total number of hidden neurons, S is the activation function for the hidden-layer neurons (logistic sigmoid in this
work and expressed as S(u) = eu/(1 + eu)), wh,o refers to the weights associated with the links connecting the hidden to the output
layer, wi,h represents the weights of input-to-hidden layer links, and xi is the ith feature characterizing the protein-ligand complex. It should be noted that the weight
variables w0,h in the wi,h set of weights serve as bias parameters and they are associated with an internal input
variable x0 whose value is always fixed at one (x0 = 1). We similarly followed the same approach to absorb the bias parameter w0,o into the hidden layer set of weights wh,o by making the sigmoid function in Equation 1 output the value one (S(.) = 1) when h = 0. This topology is shown in Figure 1 where the value 1 is fed directly to the top neurons of the network’s input and hidden
layers. The network weights wi,h and wh,o are optimized such that they minimize the fitting criterion E defined as:

(2)

where N is the number of protein-ligand complexes in the training data, yn and are the measured and predicted binding affinities of the nth complex, respectively, and ? is a regularization parameter. The parameter ? is also known as the weight decay and it guards against weights converging to large
values. Introducing the weight decay parameter avoids the scenario of saturation at
the output of the hidden-layer neurons. We scaled the input features to the range
[0, 1] to effectively optimize the weights when regularization is considered. The
accuracy of the network is maximized by performing thousands of randomized training
rounds (3000 epochs) while imposing the regularization constraint on the weights.

Limitations of ANN models and our approach to tackling them

Although multi-layer ANN models can theoretically approximate any nonlinear continuous
function, their application in drug-discovery related problems has always been complicated
by several challenges. Bioinformatics and cheminformatics data are typically high-dimensional.
Since ANN models cannot handle large number of features efficiently, a pre-processing
step prior to fitting the data using an ANN model is usually necessary. Feature subset
selection using evolutionary algorithms or dimension reduction using, say, principal
component analysis (PCA), is commonly applied to overcome this problem. However, valuable
experimental information may be discarded when only a small subset of features is
selected to build a prediction model. The dimensionality-reduction approach is also
complicated by the fact that the underlying data distribution is unknown and hence
making the right choice of which dimensionality-reduction technique to apply is a
tricky problem in itself. In addition to these pre-processing issues, training ANN
models is also a challenging task because their weights can not be guaranteed to converge
to optimal values. This causes NN models to suffer from high variance errors which
translate to unreliable SFs.

The aforementioned problems can be avoided and state-of-the-art performance can be
achieved by combining predictions of hundreds of diverse and nonlinear NN models.
We propose here ensemble methods based on ANNs. The ensemble itself is trained on
all the features, although each network in the ensemble is fitted to only a subset
of the features. This approach relieves us from carrying out feature subset selection
or dimensionality reduction prior to training. In fact, the performance of the ensemble
can even be improved by describing the data with more relevant features. Moreover,
it is no longer critical to tune the weights of each network in the ensemble to optimal
values as it is the case for a single NN model. Suboptimal weight tuning of individual
networks could contribute to decreasing the inter-correlation between them, which
translates to a diverse ensemble and therefore a more accurate model 25].

Our proposed NN ensemble models are inspired from the Random Forests 25] and Boosted Regression Trees 34] techniques in the formation of the ensembles. So far, the focus in ensemble learning
has been more or less biased toward using decision trees as base learners in forming
ensembles. Choosing trees as base learners is mainly due to their high flexibility
and variance (low stability). High variance decreases inter-correlation between trees
and therefore increases the overall ensemble model’s accuracy. Instead of using decision
trees as base learners, we employ here multi-layered perceptron (MLP) ANNs. ANN shares
several characteristics with prediction trees. They are nonparametric, nonlinear,
and have high variance. Moreover, both techniques are very fast in prediction. ANNs
such as MLP, however, have the ability of modeling any arbitrary boundary smoothly
while decision trees can only learn rectangular-shaped boundaries. Decision trees
are typically pruned after training to avoid overfitting, whereas ANN uses regularization
while the network weights are optimized during learning. We next describe our two
new ensemble NN models.

BgN-Score: ensemble neural networks through bagging

Bootstrap aggregation, or bagging for short, is a popular approach to construct an
ensemble learning model. As the name implies and as indicated in the third step of
Algorithm 1, the ensemble is composed of neural networks that are fitted to bootstrap
samples from the training data. To further increase the diversity of the ensemble
and decrease its training time, the inputs to each network l are a random subset (pl) of the total P features extracted for every protein-ligand complex (see Step 4). Feature sampling
has proven effective in building tree-based ensemble algorithms such as Random Forests
25]. When the task is to predict the binding affinity of a new protein-ligand complex,
the output is the aggregated average of the predictions of the comprising individual
networks as shown in Algorithm 1 and depicted in Figure 2. This mechanism can be formally expressed as:

Figure 2. BgN-Score: ensemble neural network SF using bagging approach.

(3)

where is a feature vector representing a protein-ligand complex characterized by a feature
set P, f(xP) is the function that maps it to binding affinity , is the same complex but characterized by a random subset pl of features (|pl| |P|), L is the number of networks in the ensemble, and is the prediction of each network l in the ensemble which is calculated at the output neuron according to Equation 1.
The final bagging-based ensemble SF is referred to as BgN-Score.

BsN-Score: ensemble neural networks through boosting

Boosting is an ensemble machine-learning technique based on a stage-wise fitting of
base learners. The technique attempts to minimize the overall loss by boosting the
complexes having highest predicted errors, i.e., by fitting NNs to (accumulated) residuals
made by previous networks in the ensemble model. There are several different implementations
of the boosting concept in the literature. The differences mainly arise from the employed
loss functions and treatment of most erroneous predictions. Our proposed NN boosting
algorithm in this work is a modified version of the boosting strategy developed by
Cao et al. 35] and Friedman 34] in that we perform random feature subset sampling. This approach builds a stage-wise
model as listed in Algorithm 2 and shown in Figure 3. The algorithm starts by fitting the first network to all training complexes. A small
fraction (? 1) of the first network’s predictions is used to calculate the first iteration of
residuals as shown in Step 3 of Algorithm 2. Step 3 also shows that the network f1 is the first term in the boosting additive model. In each subsequent stage l, a network is trained on a bootstrap sample of the training complexes described by
a random subset pl of features (Steps 5 and 6). The values of the dependent variable of the training
data for the network l are the current residuals corresponding to the sampled protein ligand complexes. The
residuals for a network at each stage are the differences between previous stage residuals
and a small fraction of its predictions. This fraction is controlled by the shrinkage
parameter ? 1 to avoid any overfitting. Network generation continues as long as the number of
networks does not exceed a predefined limit L. Each network joins the ensemble with a shrunk version of itself. In our experiments,
we fixed the shrinkage parameter to 0.001 which gave the lowest out-of-sample error.
We refer to this boosting-based ensemble SF as BsN-Score.

Figure 3. BsN-Score: ensemble neural network SF using boosting approach.

Neural networks and Random Forests scoring functions

In order to investigate the effectiveness of ensemble NN SFs in comparison to traditional
NN models and ensemble decision-tree models, we trained and tested BgN-Score, BsN-Score,
a single neural network SF referred to as SNN-Score, and a Random Forests (RF) SF on the Pr and Cr datasets, respectively, characterized by all fifteen combinations of the X, A, R,
and G features discussed above. For a fair comparison of their potential, the parameters
of these SFs were tuned in a consistent manner to optimize the mean-squared prediction
errors on validation complexes sampled without replacement from the training set and
independent of the test data. Out-of-bag instances were used as validation complexes
for BgN-Score and RF, while a ten-fold cross-validation was conducted for BsN-Score
and SNN-Score SFs. Out-of-bag (OOB) refers to complexes that are not sampled from
the training set when bootstrap sets are drawn to fit individual NNs in BgN-Score
models or decision trees in RF–on average, about 34% of the training set complexes
are left out (or “out-of-bag”) when bootstrap sets are drawn. The parameters that
are tuned and their optimized values are as follows. (1) L: the number of base learners (neural networks in ensemble NN SFs and decision trees
in RF) was set to 3000. (2) |p|: the size of the feature subset p randomly selected from the overall set of features P while constructing each neural network in ensemble NN SFs or the size of the randomly-selected
feature subset used at each node of a decision tree to perform a binary split on the
“best” feature in RF SF. This was set to 10 for ensemble SFs, except in the case where
ensemble SFs are fitted to the 6 X-Score features when it was set to 3. The number
of input neurons for SNN-Score is set to one more than the number of features used
to characterize training and test complexes. All NN SFs have one output neuron per
network that produces the binding affinity score. (3) H + 1: the number of hidden-layer neurons in NN SFs was set to 20. (4) A total of 3000
training epochs and a decay value (?) of 0.005 were used to optimize the weights for each network in the ensemble and
single NN SFs. The training process of a network is terminated earlier if the fitting
criterion defined in Equation 2 falls below 0.0001 before the maximum number of training
epochs is completed. This threshold is the default value for the abstol parameter in the nnet package that we use. (5) ?: the shrinkage parameter for BsN-Score models was set to 0.001. (6) The weights of
each network were randomly initialized in the range [-0.7,0.7]. The bounds of this
uniform distribution are the default values for the rang parameter in the nnet package.

Algorithm 1 Algorithm for building BgN-Score: an ensemble NN SF using bagging

1: Input: training data D = {XP, Y}, where , Y = {y1,…,yN}, and N is the number of training complexes.

2: for l = 1 to L do

3:   Draw a bootstrap sample from XP.

4:   Describe the complexes in the bootstrap sample using a random subset pl of features: .

5:   From Y, draw the measured binding affinities of the complexes in the sample : Yl.

6:   Construct a new training set:.

7:   Learn the current binding affinities by training an FFBP NN model fl on Dl.

8: end for

9: The final prediction of a protein-ligand complex xP is:

Algorithm 2 Algorithm for building BsN-Score: an ensemble NN SF using boosting

1: Input: training data D = {XP, Y}, where , Y = {y1,…,yN}, and N is the number of training complexes.

2: Construct { from XP by selecting a random subset p1 of features.

3: Train an FFBP NN model f1 on D1 and use it to predict BAs () of the complexes . Calculate the residuals: .

4: for l = 2 to L do

5:   Draw a bootstrap sample from XP.

6:   Describe the complexes in the bootstrap sample using a random subset pl of features: .

7:   From , draw the residuals corresponding to the complexes in the sample .

8:   Construct a new training set: .

9:   Learn the current residuals by training an FFBP NN model fl on Dl.

10:   Calculate the predictions of the NN model fl on all training complexes in the original training set D.

11:   Update the residuals:

12: end for

13: The final prediction of a protein-ligand complex xP is:

We distinguish the various NN models we built from each other using the notation NN model::tools used to calculate features. For instance, BsN-Score::XA implies that the SF is a boosted ensemble neural networks
model that is trained and tested on complex sets described by XA features. For brevity, for each of SNN-Score, BgN-Score, BsN-Score, and RF models,
we report results only for the feature combination (out of the fifteen possible) that
yields the best performance on the validation complexes sampled without replacement
from the training data and independent of the test set.

Scoring functions under comparative assessment

We compare the scoring performance of our proposed NN models to those for sixteen
scoring functions used in popular molecular docking software. The scoring accuracies
of these sixteen SFs were computed by Cheng et al. in a recent study on the same benchmark
we consider. The functions are listed in Table 1 which includes five SFs implemented in Discovery Studio, five SFs in SYBYL, three
SFs in GOLD, one in Glide, and two standalone SFs. Nine of these SFs are empirical,
four are knowledge-based, and the remaining three are based on force fields.

Table 1. The 16 conventional scoring functions and the molecular docking software in which
they are implemented

Some of the scoring functions have several options or versions, these include DrugScore,
LigScore, LUDI, PLP, and X-Score. For conciseness, we only select the version that
has the highest scoring accuracy on the PDBbind benchmark that was considered by Cheng
et al. 14]. Our NN model selection, however, was based on the validation complexes sampled without
replacement from the training data which is independent of the test set. Therefore,
the gap in performance between our proposed SFs and the conventional models we report
in the following sections could in fact be even bigger if model/version selection
of conventional SFs was done based on their performance on independent validation
sets instead of the test set Cr.