tips

Deciding when to stop: efficient experimentation to learn to predict drug-target interactions

Active learning framework

An active learning method is an iterative process composed of four components: the
initialization, the model, the active learning strategy and an accuracy measure for
the predicted output in each step (Fig. 1). Most active learning papers focus on the second and third components. The active
learning framework starts with an initialization strategy which is followed by the
generation of a model. The model is used to make predictions, in our application drug-target
interactions are predicted. Interactions can be measured by performing an experiment, i.e. a direct assay of drug-target interaction (e.g., in cell extracts). Based on
the predictions, an active learning strategy is applied to query new experiments (labels)
which will improve the model. We use batchwise learning, where a fixed number of experiments
is queried in each training round thereby increasing the size of experiments with
known label. Each training round defines a time-point in the active learning process and is measured by the number of batches of experiments
performed. For each time-point the accuracy of the model is predicted. The process
is stopped for example, if a certain budget for performing experiments is reached
or the predicted accuracy of the model is high enough. We assume equal cost for each
experiment.

Fig. 1. The major components of an active learning framework. The entries of the matrix are
color coded: label not known (light gray), interaction (black), no interaction (white).
At initialization a subset of known labels for the interactions matrix and the drug
and target kernels K d
and K t
are provided. In each round of the active learning algorithm, the labels of the entire
interaction matrix are predicted and used to determine which labels to query next.
In this figure, the dark red values represent a high probability for a hit, whereas
the dark blue values represent a high probability for a miss

Data representation

We use interaction matrices Y?{?1,1}
N×M
to represent drug-target interactions. We assume that the outcome of the experiment
determines the ground truth label for an interaction matrix entry. is the number of drugs, is the number of targets. Knowledge of the interaction between a drug d?{1,2,…,N} and a target t?{1,2,…,M} is ternary encoded in the experimental matrixX: +1 for an interaction, ?1 for lack of interaction, and 0 to denote experiments which
have not yet been performed. Hereby, the set of remaining experiments (unlabeled data)
will be denoted by . Therefore, we consider a semi-supervised binary labeling problem where the sign
of the label indicates the interaction status between a drug and a target.

Kernelized Bayesian matrix factorization (KBMF)

We use drug and target kernel matrices respectively to represent the pairwise similarity
of drugs to one another and the pairwise similarity of targets to one another. These
similarities are values between zero and one, where zero indicates no similarity and
one indicates the highest similarity. All the values on the diagonal of the kernel
are therefore one. In order to compute the similarities for the target kernel matrix
we use the normalized Smith-Waterman score 27] which uses the sequence information of two proteins to compute similarities. Other
possibilities to compute the similarity between proteins are to first compute features
using programs like ProtParam 28] or Prosite 29] as employed previously 19] and then compute the similarity between the features using a distance metric. For
computing the similarity between drugs we used SIMCOMP 30], a program which uses graphs to represent drugs and computes the similarity between
two drugs by searching the maximal common subgraph isomorphism. Other tools to compute
the similarity between drugs are included in the OpenBabel package 31].

As described previously 24], 25], KBMF can be effectively applied to model drug-target interactions. It approximates
the interaction matrix by projecting the drug kernel and the target kernel into a common subspace of dimension such that the interaction matrix Y can be reconstructed from the sign of its prediction matrix :

(1)

The prediction matrix F is a product of the projected kernel matrices:

(2)

where and are subspace transformation matrices computed by the variational Bayes algorithm
24], 25] using the values of the experimental matrix X. The dimension R of the subspace is a free parameter; we used the value of 20 previously determined
to be optimal for these datasets 25]. The entries of the kernel matrix K d
and K t
are a measure of the pairwise similarities between drugs and targets respectively.
The similarity matrices provided by Yamanishi et al. 26] and the KBMF implementation of semi-supervised classification provided by Goenen
25], 32] were used.

Note that it is not possible to factor the interaction matrix Y by multiplying the drug and target kernels directly, since they are matrices of differing
dimension. Therefore transformation matrices A d
and A t
are needed which project the drug kernel and the target kernel into a common subspace.
Since the product of the transformed kernels F should reflect the observed experiments as well as possible, the values of A t
and A d
are found such that they maximize the posterior probability of having observed the
experimental matrix X along with some prior information on the distribution of the elements in the transformation
matrices. Goenen 24], 25] used a graphical model to represent the relationships, and provided a detailed derivation
of an efficient inference scheme using variational approximation. The KBMF algorithm
is an iterative algorithm which converges usually after 200 iterations. The values
of the kernels do not necessarily have to be in the range zero to one, since the scaling
of the kernels is implicitly encoded in the transformation matrices.

Initialization and experiment selection

Our initialization strategy is to select a random column and one random experiment
from each row of the experimental matrix X.

Uncertainty sampling

We use uncertainty sampling 33] to form a batch of experiments by greedily choosing the experiments with the greatest uncertainty function U22]:

(3)

where is the set of possible labels and l is a label.

For the KBMF case the posterior probability is computed by the sigmoid function from
the predicted interactions:

(4)

and P(l=?1|x)=1?P(l=1|x) for no interaction respectively.

Here we make use of the property of the KBMF method, that the magnitude of the predicted
entry in F is an indicator for the confidence of the prediction.

Stopping rule

In order to stop the active learning process, a method is needed to predict the accuracy
of the model for a given time-point along with the confidence of that prediction.
As proposed previously in 18], the accuracy of a model at a given point in an active learning process can be predicted
using a regression function trained for other, similar experimental spaces. The fully
observed drug-target space is characterized by two measures, uniqueness (u) and responsiveness (r) 18] defined by:

(5)

(6)

where u R o w s(.) and u C o l u m n s(.) compute the number of unique rows and unique columns of a matrix.

The uniqueness and responsiveness are values in the range [0,1] and characterize the
interaction matrix. Responsiveness measures the percentage of interactions in the
matrix. Uniqueness is a measure of independence of the rows and columns in the matrix.
The higher the value for uniqueness is, the more difficult it is to make predictions.

These two measures have two purposes: (1) They are used to compute features for a
time-step in our current active learning process. (2) They can be used to generate
simulation data having similar properties to the measured experimental data.

Each time-point t i
is described by a vector of 13 features , p=13, defined as:

f(1),f(2): average observed responsiveness across columns (respectively rows)

f(3),f(4): average predicted responsiveness across columns (respectively rows)

f(5): average difference in predictions from last prediction for current time-point
(t i
)

f(6): average difference in predictions from last prediction for previous time-point
(t i?1
)

f(7): fraction of predictions at t i?1
observed as responsive (l=1) at t i

f(8),f(9),f(10): minimum, maximum and mean number of experiments that have been performed for
any drug

f(11),f(12),f(13): minimum, maximum and mean number of experiments that have been performed for
any target

These features are normalized to the range [0..1] and additional features are generated
by computing the square root of their pairwise products (a simple way to create quadratic
terms in the regression models). The extended feature vector is formed by concatenating the entries , i,j?{1,2,..,p} and ij to the original feature vector. These extended feature vectors are predictor variables , while the true accuracies are stored as rows in the vector of observations . Therefore, our predictor follows a linear model:

(7)

where N f
is the number of observations used, and t?0 is a tuning parameter. We use lasso regression 34] to learn the vector of response coefficients .

To learn the accuracy predictor via simulation data, interaction matrices of size
50×50 were randomly sampled in the grid of uniqueness and responsiveness parameters
5 %,10 %,…,95 %. For each interaction matrix we derived ’perfect’ Gaussian similarity kernels K d
,K t
by pairwise distances of the column-space and row-space, respectively. These were
disrupted by forcing 0 %,5 %,10 % of the kernel entries to the value 1 and regularized to ensure positive semidefiniteness.
Features computed from trajectories of the uncertainty sampling active learner on
these data were collected; for eachtrajectory we also measured the accuracy of prediction
against the ground truth. A linear model of these features against adjusted accuracies
(accuracy above the fraction of experiments performed so far) was fitted by lasso
regression 34]. The lasso regularization parameter was chosen by 11-fold cross validation under
squared loss, with holdout granularity at the level of trajectories. To make accuracy
predictions from adjusted accuracy predictions, we added the fraction of experiments
performed so far.