Machine-learning scoring functions for identifying native poses of ligands docked to known and novel proteins

Evaluation of scoring functions

In contrast to our earlier work in improving and examining scoring and ranking accuracies
of different families of SFs 3,4], this study is devoted to enhancing and comparing SFs in terms of their docking powers.
Docking power measures the ability of an SF to distinguish a promising binding mode
from a less promising one. Typically, generated conformations are ranked in non-ascending
order according to their predicted binding affinity (BA). Ligand poses that are very
close to the experimentally-determined ones should be ranked high. Closeness is measured
in terms of RMSD (in Ã…) from the true binding pose. Generally, in docking, a pose
whose RMSD is within 2 Ã… from the true pose is considered a success or a hit.

In this work, we use comparison criteria similar to those used by Cheng et al. to
compare the docking accuracies of sixteen popular conventional SFs. Doing so ensures
fair comparison of ML SFs to those examined in that study in which each SF was assessed
in terms of its ability to find the pose that is closest to the native one. More specifically,
docking ability is expressed in terms of a success rate statistic S that accounts for the percentage of times an SF is able to find a pose whose RMSD
is within a predefined cutoff value C Ã… by only considering the N topmost poses ranked by their predicted scores. Since success rates for various C (e.g., 0, 1, 2, and 3 Ã…) and N (e.g., 1, 2, 3, and 5) values are reported in this study, we use the notation to distinguish between these different statistics. For example, is the percentage of protein-ligand complexes whose either one of the two best scoring
poses are within 1 Ã… from the true pose of a given complex. It should be noted that
is the most stringent docking measure in which an SF is considered successful only
if the best scoring pose is the native pose. By the same token and based on the C and N values listed earlier, the least strict docking performance statistic is in which an SF is considered successful if at least one of the five best scoring
poses is within 3 Ã… from the true pose.

ML vs. conventional approaches on a diverse test set

After building six ML SFs, we compare their docking performance to the sixteen conventional
SFs on the core test Cr that comprises thousands of protein ligand complex conformations corresponding to
195 different native poses in 65 diverse protein families. As mentioned earlier, we
conducted two experiments. In the first, BA values predicted using the conventional
and ML SFs were used to rank poses in a non-ascending order for each complex in Cr. In the other experiment, RMSD-based ML models directly predicted RMSD values that
are used to rank in non-descending order the poses for the given complex.

By examining the true RMSD values of the best N scoring ligands using the two prediction approaches, success rates of SFs are computed;
these are shown in Figure 2. Panels (a) and (b) in the figure show the success rates , , and for all 22 SFs. The SFs, as in the other panels, are sorted in non-ascending order
from the most stringent docking test statistic value to the least stringent one. In
the top two panels, for example, success rates are ranked based on , then in case of a tie in , and finally if two or more SFs tie in . In both BA- and RMSD-based scoring, we find that the 22 SFs vary significantly in
their docking performance. The top three BA-based SFs, GOLD::ASP, DS::PLP1, and DrugScorePDB::PairSurf,
have success rates of more than 60% in terms of measure. That is in comparison to the BA-based ML SFs, the best of which has an value barely exceeding 50% (Figure 2(a)). On the other hand, the other six ML SFs that directly predict RMSD values achieve
success rates of over 70% as shown in Figure 2(b). The top performing of these ML SFs, MARS::XARG, has a success rate of ~80%. This
is a significant improvement ( 14%) over the best conventional SF, the empirical
GOLD::ASP, whose value is ~70%. Similar conclusions can also be made for the less stringent docking
performance measures and in which the RMSD cut-off constraint is relaxed to 2 Ã… and 3 Ã…, respectively.

Figure 2. Success rates of conventional and ML SFs in identifying binding poses that are closest
to native ones
. The results show these rates by examining the top N scoring ligands that lie within an RMSD cut-off of C Ã… from their respective native poses. Panels on the left show success rates when binding-affinity
based (BA) scoring is used and the ones on the right show the same results when ML SFs predicted
RMSD values directly. Scoring of conventional SFs is BA-based in all cases and for comparison
convenience we show their performance in the right panels as well.

The success rates plotted in the top two panels (Figure 2(a) and 2(b)) are reported when native poses are included in the decoy sets. Panels (c) and (d)
of the same figure show the impact of removing the native poses on docking success
rates of all SFs. It is clear that the performance of almost all SFs does not radically
decrease by examining the difference in their statistics which ranges from 0 to ~5%. This, as it was noted by Cheng et al. 2], is due to the fact that some of the poses in the decoy sets are actually very close
to the native ones. As a result, the impact of allowing native poses in the decoy
sets is insignificant in most cases and therefore we include such poses in all other
tests in the paper.

In reality, more than one pose is usually used from the outcomes of a docking run
in the next stages of drug design for further experimentation. It is useful therefore
to assess docking accuracy of SFs when more than one pose is considered (i.e., N 1). Figure 2(e) and 2(f) show the success rates of SFs when the RMSD values of the best 1, 2, and 3 scoring
poses are examined. These rates correspond, respectively, to , , and . The plots show a significant boost in performance for almost all SFs. By comparing
to , we observe a jump in accuracy from 82% to 92% for GOLD::ASP and from 87% to 96%
for RF::RG that models RMSD values directly. Such results signify the importance of
examining an ensemble of top scoring poses because there is a very good chance it
contains relevant conformations and hence good drug candidates.

Upon developing RMSD-based ML scoring models, we noticed excellent improvement over
their binding-affinity-based counterparts as shown in Figure 2. We conducted an experiment to investigate whether they will maintain a similar level
of accuracy when ML SFs are examined for their ability to pinpoint the native poses
from their respective 100-pose decoy sets. The bottom two panels, (g) and (h), plot
the success rates in terms of , , and for the six ML SFs. By examining the five best scoring poses, we notice that the
top BA-based SF, MLR::X, was able to distinguish native binding poses in ~60% of the
195 decoy sets whereas the top RMSD-based SF, MARS::XARG, achieved a success rate
of = 77% on the same protein-ligand complexes. It should be noted that both sets of
ML SFs, the BA- and RMSD-based, were trained and tested on completely disjoint test
sets. Therefore, this gap in performance is largely due to the explicit modeling of
RMSD values and the corresponding abundance of training data which includes information
from both native and computationally-generated poses.

ML vs. conventional approaches on homogeneous test sets

In the previous section, performance of SFs was assessed on the diverse test set Cr. The core set consists of more than sixty different protein families each of which
is related to a subset of protein families in Pr. That is, while the training and test set complexes were different (at least for
all the ML SFs), proteins present in the core test set were also present in the training
set, albeit bound to different ligands. A much more stringent test of SFs is their
evaluation on a completely new protein, i.e., when test set complexes all feature
a given protein – test set is homogeneous – and training set complexes do not feature
that protein. To address this issue, four homogeneous test sets were constructed corresponding
to the four most frequently occurring proteins in our data: HIV protease (112 complexes),
trypsin (73), carbonic anhydrase (44), and thrombin (38). Each of these protein-specic
test sets was formed by extracting complexes containing the protein from Cr (one cluster or three complexes) and Pr (remaining complexes). For each test set, we retrained BRT, RF, SVM, kNN, MARS, and MLR models on the non-test set complexes of Pr. Figure 3 shows the docking performance of resultant BA and RMSD-based ML scoring models on
the four protein families. The plots clearly show that success rates of SFs are dependent
on the protein family under investigation. It is easier for some SFs to distinguish
good poses for HIV protease and thrombin than for carbonic anhydrase. The best performing
SFs on HIV protease and thrombin complexes, performance of resultant BA and RMSD-based
ML scoring models on the four protein families. The plots clearly show that success
rates of SFs are dependent on the protein family under investigation. It is easier
for some SFs to distinguish good poses for HIV protease and thrombin than for carbonic
anhydrase. The best performing SFs on HIV protease and thrombin complexes, MLR::XRG
and MLR::XG, respectively, achieve success rates of over 95% in terms of as shown in panels (b) and (n), whereas no SF exceeded 65% in success rate in case
of carbonic anhydrase as demonstrated in panels (i) and (j). Finding the native poses
is even more challenging for all SFs, although we can notice that RMSD-based SFs outperform
those models that rank poses using predicted BA. The exception to this is the SF MLR::XAR
whose performance exceeds all RMSD-based ML models in terms of the success rate in
reproducing native poses as illustrated in panels (c) and (d).

Figure 3. Success rates of ML SFs in identifying binding poses that are closest to native ones
observed in four protein families: HIV protease (a-d), trypsin (e-h), carbonic anhydrase
(i-l), and thrombin (m-p)
. The results show these rates by examining the top N scoring ligands that lie within an RMSD cut-off of C Ã… from their respective native poses. Panels on the left show success rates when binding-affinity
based (BA) scoring is used and the ones on the right show the same results when ML SFs predicted
RMSD values directly.

The results also indicate that multivariate linear regression models (MLR), which
are basically empirical SFs, are the most accurate across the four families, whereas
ensemble learning models, RF and BRT, unlike their good performance in Figure 2, appear to be inferior compared to simpler models in Figure 3. This can be attributed to the high rigidity of linear models compared to ensemble
approaches. In other words, linear models are not as sensitive as ensemble techniques
to the presence or absence of certain protein family in the data on which they are
trained. On the other hand, RF- and BRT-based SFs are more flexible and adaptive to
their training data that in some cases fail to generalize well enough to completely
different test proteins as seen in Figure 3. In practice, however, it has been observed that more than 92% of today’s drug targets
are similar to known proteins in PDB 33], an archive of high quality complexes from which our training and test compounds
originated. Therefore, if the goal of a docking run is to identify the most stable
poses, it is important to consider sophisticated SFs (such as RF and BRT) calibrated
with training sets containing some known binders to the target of interest. Simpler
models, such as MLR and MARS, tend to be more accurate when docking to novel proteins
that are not present in training data.

Sophisticated ML algorithms are not the only critical element in building a capable
SF. Features to which they are fitted also play an important role as can be seen in
Figure 3. By comparing the right panels to the ones on the left, we can notice that X-Score
features (X) are almost always present in BA-based SFs while those provided by GOLD
(G) are used more to model RMSD explicitly. This implies that X-Score features are
more accurate than other feature sets in predicting BA, while GOLD features are the
best for estimating RMSD and hence poses close to the native one.

Performance of ML SFs on novel targets

The training-test set pair (Pr, Cr ) is a useful benchmark when the aim is to evaluate the performance of SFs on targets
that have some degree of sequence similarity with at least one protein present in
the complexes of the training set. This is typically the case since, as it was mentioned
earlier, 92% of drug targets are similar to known proteins 33]. When the goal is to assess SFs in the context of novel protein targets, however,
the training-test set pair (Pr, Cr ) is not that suitable because of the partial overlap in protein families between
Pr and Cr. We considered this issue to some extent in the previous section, where we investigated
the docking accuracy of SFs on four different protein-specific test sets after training
them on complexes that did not have the protein under consideration. This resulted
in a drop in performance of all SFs, especially, in the case of carbonic anhydrase
as a target. However, even if there are no common proteins between training and test
set complexes, different proteins at their binding sites may have sequence and structural
similarities, which influence docking results. To more rigorously and systematically
assess the performance of BA and RMSD-based ML SFs on novel targets, we performed
a separate set of experiments in which we limited BLAST sequence similarity between
the binding sites of proteins present in the training and test set complexes. Sequence
similarity was used to construct the core test set and it was also noted by Ballester
and Mitchell as being relevant to testing the efficacy of SFs on a novel target 34].

Specifically, for each similarity cut-off value S = 30%, 40%, 50%,…, 100%, we constructed 100 different independent 100-complex test
and T-complex training set pairs. Two versions were created out of these training and test
set pairs. The first version uses BA as a response variable that SFs will be fitted
to, predict, and employ to assess poses. The response variable of the other version
is the RMSD value of true poses (RMSD = 0 Ã…) and computer generated decoys (with RMSD
0 Ã…) of each original protein-ligand complex in every training and test dataset
pair. A total of 20 poses per complex have been used in this second version. Then,
we trained BA and RMSD scoring models (MLR, MARS, kNN, SVM, RF, and BRT) using XARG features on the training set and evaluated them on
the corresponding test set, and determined their average performance over the 100
training-test-set pairs to obtain robust results. Since SF docking performance depends
upon both similarity cut-off and training set size and since training set size is
constrained by similarity cut-off (a larger S means a larger feasible T ), we investigated different ranges of S (30% to 100%, 50% to 100%, and 70% to 100%) and for each range we set T close to the largest feasible value for the smallest S value in that range. Each test and training set pair was constructed as follows. We
randomly sampled a test set of 100 protein-ligand complexes without replacement from
all complexes at our disposal: 1105 in Pr + 195 in Cr = 1300 complexes. The remaining 1200 complexes were randomly scanned until T different complexes were found that had protein binding site similarity of S % or less with the protein binding sites of all complexes in the test set – if less
than T such complexes were found, then the process was repeated with a new 100-complex test
set.

The performance of the six scoring models in terms of their docking accuracy is depicted in Figure 4 for various similarity cut-offs and training set sizes. The plots in each column
(a) and (d), (b) and (e), and (c) and (f) of Figure 4) show docking power results for similarity cut-offs of 30% to 100%, 50% to 100%,
and 70% to 100% for which T = 100, 400, and 700 complexes is the largest training set size feasible for S = 30%, 50%, and 70%, respectively. The results in these plots are consistent with
those obtained for the four protein families presented in the previous section and
illustrated in Figure 3. More specifically, we notice that simpler models such as MLR::XARG and MARS::XARG
perform the best across almost all values of similarity cut-offs (S = 30%, 50%, or 70% – 100%), training set sizes (T = 100, 400, or 700 complexes), and response variables (Y = BA or RMSD). This is mainly due to their rigidity. The performance of such models
do not suffer as much as the more flexible ML SFs when their training and test proteins
have low sequence similarity. On the other hand, the SFs based on MLR and MARS are
also less responsive to increasing the similarity between protein families in the
training and test sets. Unlike the other four nonlinear ML SFs, we can observe that
the performance curves of MLR and MARS are flat and do not seem to react to having
more and more similar training and test proteins. This observation is more clear in
the bottom row of plots of Figure 4 where the training set sizes are large enough (i.e., 2000 ligand poses or more).
Plot (f) shows that the RMSD-based SFs RF and BRT are catching up with MLR and MARS
and can eventually surpass them in terms of performance as training set sizes become
larger. Similar to RF and BRT, the other nonlinear RMSD SFs, namely kNN and SVM, have the sharpest increase in docking performance as similarity cut-off
S increases. However, unlike the ensemble SFs RF and BRT, kNN and SVM SFs are the least reliable models when ligand poses need to be scored for
novel targets.

Figure 4. Performance of SFs in terms of docking success rate as a function of BLAST sequence similarity cutoff between binding sites of proteins
in training and test complexes
. In panels (a)-(c), a single (native) pose is used per training complex, whereas
in panels (d)-(f) 20 randomly-selected poses are used per training complex.

To summarize, imposing a sequence similarity cut-off between the binding sites of
proteins in training and test set complexes has an expected adverse impact on the
accuracy of all scoring models. However, increasing the number of training complexes
helps improve accuracy for all similarity cut-offs as we will show in more detail
in the next section. Scoring functions based on MLR and MARS have the best accuracy
when training set sizes are small which is typically the case when the response variable
is binding affinity. The other generally-competitive ML models are RF and BRT whose
accuracies surpass all other SFs when evaluated on targets that have some sequence
similarity with their training proteins.

Impact of training set size

An important factor influencing the accuracy of ML SFs is the size of the training
dataset. In the case of BA-based ML SFs, training dataset size can be increased by
training on a larger set of protein-ligand complexes with known binding affinity values.
In the case of RMSD-based SFs, on the other hand, training dataset size can be increased
not only by considering a large number of protein-ligand complexes in the training
set, but also by using a larger number of computationally-generated ligand poses per
complex since each pose provides a new training record because it corresponds to a
different combination of features and/or RMSD value. Unlike experimental binding affinity
values, which have inherent noise and require additional resources to obtain, RMSD
from the native conformation for a new ligand pose is computationally determined and
is accurate.

We carried out three different experiments to determine: (i) the response of BA-based
ML SFs to increasing number of training protein-ligand complexes, (ii) the response
of RMSD-based ML SFs to increasing number of training protein-ligand complexes while
the number of poses for each complex is fixed at 50, and (iii) the response of RMSD-based
ML SFs to increasing number of computationally-generated poses while the number of
protein-ligand complexes is fixed at 1105. In the first two experiments, we built
6 ML SFs, each of which was trained on a randomly sampled x% of the 1105 protein-ligand complexes in Pr, where x = 10, 20,…, 100. The dependent variable in the first experiment is binding affinity
(Y = BA), and the performance of these BA-based ML SFs is shown in Figure 4(a) and partly in Figure 4(d) (MLR::XARG). The set of RMSD values from the native pose is used as a dependent variable
for ML SFs trained in the second experiment (Y = RMSD). For a given value of x, the number of conformations is fixed at 50 ligand poses for each protein-ligand
complex. The docking accuracy of these RMSD-based ML models is shown in Figure 5(b). In the third experiment, all 1105 complexes in Pr were used for training the RMSD-based ML SFs (i.e., Y = RMSD) with x randomly sampled poses considered per complex, where x = 2, 6, 10,…, 50; results for this are reported in Figure 5(c) and partly in Figure 5(d) (MARS::XARG). In all three experiments, results reported are the average of 50
random runs in order to ensure all complexes and a variety of poses are equally represented.
All training and test complexes in these experiments are characterized by the XARG
(=X ? A ? R ? G) features.

Figure 5. Dependence of docking accuracy of ML scoring models on training set size when training
complexes are selected randomly (without replacement) from Pr and the models are tested on Cr
. The size of the training data was increased by including more protein-ligand complexes
(a) and (b) or more computationally-generated poses for all complexes (c) and (d).

From Figure 5(a), it is evident that increasing training dataset size has a positive impact on docking
accuracy (measured in terms of success rate), although it is most appreciable in the case of MLR::XARG and MARS::XARG,
two of the simpler models, MLR being linear and MARS being piecewise linear. The performance
of the other models, which are all highly nonlinear, seems to saturate at 60% of the
maximum training dataset size used. The performance of all six models is quite modest,
with MLR::XARG being the only one with docking success rate (slightly) in excess of
50%. The explanation for these results is that binding affinity is not a very good
response variable to learn for the docking problem because the models are trained
only on native poses (for which binding affinity data is available) although they
need to be able to distinguish between native and non-native poses during testing.
This means that the training data is not particularly well suited for the task for
which these models are used. An additional reason is that experimental binding affinity
data, though useful, is inherently noisy. The flexible highly nonlinear models, RF,
BRT, SVM, and kNN, are susceptible to this noise because the training dataset (arising only from
native poses) is not particularly relevant to the test scenario (consisting of both
native and non-native poses). Therefore, the more rigid MLR and MARS models fair better
in this case.

When RMSD is used as the response variable, the training set consists of data from
both native and nonnative poses and hence is more relevant to the test scenario and
the RMSD values, being computationally determined, are also accurate. Consequently,
docking accuracy of all SFs improves dramatically compared to their BA-based counterparts
as can be observed by comparing Figure 5(a) to Figure 5(b) and 5(c). We also notice that all SFs respond favorably to increasing training set size by
either considering more training complexes (Figure 5(b)) or more computationally-generated training poses (Figure 5(c)). Even for the smallest training set sizes in Figure 5(b) and 5(c), we notice that the docking accuracy of most RMSD-based SFs is about 70% or more,
which is far better than the roughly 50% success rate for the largest training set
size for the best BA-based SF MLR::XARG.

In Figure 5(d), we compare the top performing RMSD SF, MARS::XARG, to the best BA-based SFs, GOLD::ASP
and MLR::XARG, to show how docking performance can be improved by just increasing
the number of computationally-generated poses, an important feature that RMSD-based
SFs possess but which is lacking in their BA-based conventional counterparts. To increase
the performance of these BA-based SFs to a comparable level, thousands of protein-ligand
complexes with high-quality experimentally-determined binding affinity data need to
be collected. Such a requirement is too expensive to meet in practice. Furthermore,
RMSD-based SFs with the same training complexes will still likely outperform BA-based
SFs.

Impact of the type and number of features

The binding pose of a protein-ligand complex depends on many physicochemical interaction
factors that are too complex to be accurately captured by any one approach. Therefore,
we perform two different experiments to investigate how utilizing different types
of features from different scoring tools, X-Score, AffiScore, RF-Score, and GOLD,
and considering an increasing number of features affects the performance of the various
ML models. In the first experiment, the ML models were trained on Pr characterized by all 15 combinations of X, A, R, and G feature types and tested on
the corresponding core test Cr characterized by the same features. Table 2 reports the docking success rate for three groups of ML SFs. The first set (Table 2 top part) of 90 (6 methods × 15 feature combinations) BA-based SFs is trained on
1105 Pr complexes. The second set (Table 2 middle part) of 90 RMSD-based SFs is again trained on the 1105 Pr complexes with one randomly sampled pose from 50 poses generated per complex. Therefore,
the training set size for these first two groups of SFs is identical and consists
of 1105 training records, with the only difference being the response variable that
they are trained for, BA in the first case and RMSD in the second case. The final
(Table 2 bottom part) 90 RMSD-based SFs are trained on 1105 Pr complexes, with 50 poses per complex, so that its training set size is 1105 × 50 =
55,250 records.

Table 2. Docking success rate (in %) of ML SFs trained on Pr and tested on Cr complexes when characterized by different features.

We notice that the value of almost all models improves by considering more than one type of feature
rather than just X, A, R, or G features alone. The table also shows that RMSD SFs
are substantially more accurate than their BA counterparts for each feature type and
ML method. By comparing the 180 RMSD SFs with the corresponding 90 BA SFs across all
feature types and ML models, we find that the former are, on average, almost twice
as accurate as the BA approaches (50.64% and 57.61% vs. 27.95% -see Table 2 rightmost column). In terms of feature types, we note that the most accurate SFs
always include X-Score and GOLD features. SFs that are fitted to the individual ×
and G features only are more accurate than their A and R counterparts whether they
are BA or RMSD models. By averaging the performance of all ML models across all feature
types, we see that the simple linear approach MLR outperforms other more sophisticated
ML SFs that are trained to predict binding affinity. MARS outperforms all other RMSD
SFs that are trained on the same number of training records (1105) as their BA counterparts.
The lower part of the table shows that the ensemble SF RF that predicts the binding
pose directly has the highest docking accuracy (62.47%), on the average, across 15
different feature types and MARS::XARG has the highest docking accuracy (78.97%) overall.
Comparing the two versions of RMSD SFs in the middle and lower portions of the table,
we notice that the largest gainers from increasing training set size are the most
nonlinear ML techniques (RF, BRT, SVM and kNN). The results of Table 2 are useful in assessing the relative benefit of different types of features for the
various ML models.

A pertinent issue when considering a variety of features is how well different SF
models exploit an increasing number of features. The features we consider are the
X, A, G, and a larger set of geometrical features than the R feature set available
from the RF-Score tool. Recall from the Compound Characterization subsection that
RF-Score counts the number of occurrences of 36 different protein-ligand atom pairs
within a distance of 12 Ã…. In order to have more features of this kind for this experiment,
we produce 36 such counts for five contiguous distance intervals of 4 Ã… each: (0 Ã…,
4 Å], (4 Å, 8 Å],…, (16 Å, 20 Å]. This provides us 6 X, 30 A, 14 G, and (36 × 5
=) 180 geometrical features or a total of 220 features. We randomly select (without
replacement) x features from this pool, where x = 20, 60, 100,…, 220, and use them to characterize the Pr dataset, which we then use to train the six ML models. These models are subsequently
tested on the Cr dataset characterized by the same features. This process is repeated 100 times to
obtain robust average statistics, which are plotted in Figure 6.

Figure 6. Dependence of docking accuracy of ML scoring models on the number of features, with
the features drawn randomly (without replacement) from a pool of X, A, R, and G-type
features and used to train the ML models on the Pr dataset and then tested on the disjoint core set Cr
. In panels (a) and (b), a single pose (native pose in (a) and randomly-selected pose
in (b) is used per training complex, whereas in panel (c) 50 randomly-selected poses
are used per training complex.

The performance of the BA SFs is depicted in Figure 6(a) whereas panels (b) and (c) of the same figure show the docking success rates for
the RMSD versions of the scoring models. In order to fairly compare the docking performance
of BA and RMSD SFs as number of features increase, we fixed their training set sizes
to 1105 complexes as shown in Figure 6(a) and 6(b). We also show in Figure 6(c) the effect of increasing number of features on the docking performance of RMSD SFs
when trained on all Pr complexes, with 50 poses per complex. The plots clearly indicate that RMSD SFs benefit
the most from characterizing complexes with more descriptors. This is the case regardless
of the number of records used to train RMSD SFs (compare plots (b) and (c) in Figure
6). The only exception is the RMSD SF based on SVM where it appears to overfit the
1105 training records when they are characterized by more than 60 features. This ML
scoring function, however, performs better when trained on larger number of records
and shows a slight increase in performance as more features are included in building
the model. Other RMSD SFs such as RF, BRT, MLR, and MARS have much sharper slopes
than SVM and kNN. Compare these SFs to their BA counterparts in Figure 6(a) where most of them show none to little improvement as the number of features increases
due to overfitting. Not only are they resilient to overfitting, most RMSD SFs improve
dramatically by extracting more relevant features. Adding more features may result
in highest gains in performance when more training complexes are included as was discussed
in the previous subsection.