RNA-binding residues prediction using structural features

Dataset

The dataset used in our experiments is RB344 of PRIDB dataset 40]. RB344 is a non-redundant dataset which contains 344 proteins belonging to 13 categories:
RNAse, SRP, aptamer, dsRNA, exosome, mRNA, ribosomal, small, snRNP, splicing, tRNA,
viral, and other. Global sequence alignment was performed on the dataset using the
needle function provided by the emboss suite 41]. The sequence identities in RB344 are smaller than 30 %. RNA-binding residues were
determined using two definitions: (i) a residue whose any atom is within a 5 Å distance
of any atom in a nucleotide; and (ii) residues involved in van der Waals, hydrogen-bonding,
hydrophobic or electrostatic interactions with nucleotides 40]. Any amino acid residue satisfying the above definitions are regarded as RNA-binding
residue. RNA-protein complexes in the dataset are shown in Table 4.

Table 4. Protein-RNA complexes in RB344 dataset

Identification of protein surface residues

To determine protein surface residues, accessible areas will be computed first. If
its accessible area is larger than zero, the residue is considered as a surface residue.
Otherwise it is a non-surface residue. The accessible areas can be calculated using
VMD software 42] with the probe radius of 1.4 Å.

Shape descriptor for protein residues

The backbone of an amino acid is defined by four atoms: N, CA, C, and O. The center
of the side-chain is defined as the mean of coordinates of all heavy atoms on the
side-chain. However, the side-chain of glycine only has a hydrogen substituent. Therefore,
the hydrogen is used as the side-chain center of glycine. An amino acid is represented
by atoms from its backbone and the point of its side-chain center.

Template patches construction

The neighbors of a protein residue are defined as amino acids which are located within
a certain distance (d N
) on the protein surface. The distance between two amino acids is the smallest distance
between their atoms. A patch is called 2-aa/3-aa patch if it is composed by two/three
residues, respectively. Suppose a residue has a set of neighbors {n1
, n2
, …, n k
}. When k???2, two neighbors can be selected from the neighbor set and construct a 3-aa patch
with the residue. The total number of 3-aa patches constructed from the neighbor set
is C2 k. When k?=?1, a 2-aa patch can be constructed which is composed by the residue and its neighbor.
The situation of k?=?0 is not considered currently because we assume that interaction interfaces are
areas consisting of two or more residues. If a surface residue in the training set
is known to interact with RNAs, the patches constructed from its neighbors are regarded
as positive patches. Only positive patches are used as template patches. We obtained
175,989 3-aa template patches and 122 2-aa template patches from RB344 dataset when
d N
equals 3 Å.

Structural similarity between patches

Because each amino acid can be represented as a set of atoms from the backbone and
the center of its side chain, a 3-aa/2-aa patch can be represented as the assembly
of representative points from all its member residues. When comparing shapes of two
surface patches of the same size, i.e. both of them are 3-aa or 2-aa patches, they
are treated as rigid objects. The structural similarity between two patches can be
measured by the sum of Euclidean distances between the corresponding points after
rotation and translation (i.e. the least-squares distance between two sets of points).
Suppose a patch contains m (m?{2,3}) residues, each of which is composed by n points, the least-squares distance between patch X and Y can be computed using Eq. (1).

(1)

where s is a scale factor, R is a rotation matrix, and t is a translation vector. x ij
and y ij
are the j-th point from i-th residue of X and Y respectively. The optimal solution of s, R, and t for Eq. (1) is:

(2)

In Eq. (2), and are the centroids of X and Y. Matrices U and V are obtained by singular value decomposition: Y‘X‘
T
?=?U?V T
, where X‘?=?{x‘
ij
}
i=1,…,m;j=1,…,n
and Y‘?=?{y‘
ij
}
i=1,…,m;j=1,…,n
are obtained by subtracting and from the points, i.e. and ; i?=?1,…,m and j?=?1,…,n. Details of this optimal solution can be found in 43]. In our problem, we assume that there is no scale change between two similar patches.
Therefore, the scale factor s is set to 1 instead of using the value in Eq. (2).

To compute the least-squares distance, the correspondence between points from two
objects should be known in advance. However, the correspondence between the residues
from two patches has not been determined yet. Therefore, orders of residues in patch
X are permuted to create different correspondences to residues in patch Y. Once the correspondence between residues from two patches has been determined, the
correspondence between their representative points will be automatically determined.
To compare two 3-aa patches, there are 6 ways (P3 3) of correspondences and to compare two 2-aa patches, there are 2 ways (P2 2) of correspondences. We compute the least-squares distances between two patches
using different correspondences and the minimum one is defined as the structural similarity
(d SS
) between two patches. d SS
can be computed using Eq. (3) according to its definition.

(3)

In Eq. (3), X(i)
is the i-th way of reordering residues in patch X and there are Pm m ways (m is the number of residues in a patch) of residue reordering in all.

Clustering template patches

It’s difficult to use all template patches to construct structural features. Therefore,
we select some representative ones from them so that the dimension of structural features
can be acceptable. We group the extracted 3-aa and 2-aa template patches using complete-linkage
hierarchical clustering. The distance metric used in clustering algorithm is the least-squares
distance shown in Eq. (3). The cluster dendrograms of 3-aa patches and 2-aa patches extracted from the training
set in a fold of cross validation are shown in Fig. 3. Hierarchical clustering using single- and average-linkage is also performed. However,
the resulting dendrograms have ladder shapes. It indicates that these two methods
are not suitable for clustering template patches. The final clusters represent distinct
structural patterns of template patches. They are more or less similar to protein
structural motifs but are much smaller. They can be regarded as binding units of interaction
interfaces of proteins and are used to describe RNA-binding surfaces.

Fig. 3. Hierarchical clustering on 3-aa and 2-aa template patches. Hierarchical clustering
with complete-linkage on (a) 3-aa and (b) 2-aa template patches

In each fold of cross validation, there are ~130,000 3-aa template patches constructed
from the training set. We randomly selected 10,000 3-aa template patches and performed
hierarchical clustering. The selected 3-aa patches and all 2-aa patches are further
grouped into 40 and 20 clusters. In each cluster, the centroid patch, which has the
smallest sum-of-square distance to other members, is also determined. The centroid
patches are regarded as the representative patches.

Patches in each cluster reveal distinct structural patterns. For example, in one cluster,
three amino acids are arranged in a linear way (see Fig. 4(a)). While in another, they are placed like the head of a fork (see Fig. 4(b)). The sequences of patches in each cluster are not conserved. However, their structures
are quite similar. It indicates that template patches have specific structural patterns
and RNAs may have structure preference when binding with proteins.

Fig. 4. Structures from two clusters of 3-aa template patches. (a) Four template patches in Cluster 1; (b) four template patches in Cluster 4

The new structural features construction

Given a residue r, structural features can be constructed in the following way. Firstly, all neighbours
of r located within the distance of d N
on the protein surface are identified. Suppose there are k neighbouring residues and they are denoted by {n1
, n2
,…, n k
}. A set of patches {X1
,X2
,…,X K
} can be constructed using {n1
, n2
,…, n k
} and r: if k?=?1, i.e. r has only one neighbour n1
, a 2-aa patch {X1
} which is simply composed of r and n1
can be constructed. If k???2, several 3-aa patches which are composed of r and two of its neighbours can be constructed (the total number of 3-aa patches, K, equals to ). To construct structural features for residue r, {X1
,X2
,…,X K
} will be compared with each representative patch and accumulated distances to each
representative patch will be computed.

Denote the set of representative patches by {Y1
,Y2
,…,Y L
}. The length of the new feature vector L is the total number of 3-aa and 2-aa representative patches. In our work, L is 60 (because there are 40 3-aa representative patches and 20 2-aa representative
patches). Rearrange representative patches and let {Y1
,Y2
,…,Y40
} be 3-aa representative patches and {Y41
,Y42
,…,Y60
} be 2-aa representative patches.

If k?=?1, there will be only one 2-aa patch {X1
} surrounding r. X1
will be compared with each 2-aa representative patch. Suppose f j
is the distance of X1
to Y j
(40??j???L). Then

(4)

Because X1
only contains two residues, it cannot be compared with 3-aa representative patches.
The distances between X1
and 3-aa representative patches are directly set zeros.

If k???2, each 3-aa patch X i
(i?=?1,…,K) is compared with Y j
(1???j???40). f j
can be computed using Eq. (5).

(5)

f j
(j?=?1,…,40) is the accumulated distance of surrounding patches {X1
, X2
,…, X K
} to the representative patch Y j
. When 40??j???60, f j
is set zeroes because X i
is a 3-aa patch which cannot be compared with 2-aa representative patches. In the
end, a 60-dimension feature vector [f1
,…,f L
] can be constructed for the residue r.

The rationale of comparing {X1
,X2
,…, X K
} with representative patches is as follows. The protein surface around a binding
residue can be characterized by template patches. After clustering, template patches
can be approximated by representative patches. Therefore, we can describe the protein
surface surrounding a binding residue using the combination of representative patches.
The problem is how to quantitatively measure the structural similarity of surfaces
surrounding two residues. Considering that surfaces can be approximated by combinations
of representative patches, we compute the accumulated distance of surrounding patches
to each representative patch and denote it as a structural feature. If there are L representative patches, L features will be obtained. These structural features contain potential structural
information. It can be seen that, for all residues, no matter RNA-binding or non-RNA-binding,
their structural features can be constructed by computing the accumulated distance
of surrounding patches to representative patches. Given a target residue r, if its surrounding surface is similar to the surfaces surrounding RNA-binding residues,
its structural features will be more close to features of RNA-binding residues. Based
on its structural features, r can be classified as an RNA-binding residue or a non-RNA-binding residue.

Other features used for RNA-binding residue prediction

In addition to the proposed structural features, other sequence features of amino
acids are also introduced to describe RNA-binding property. Each residue in RNA-binding
proteins is characterized by another two descriptors including: (i) PSSM which gives
values of sequence conservation for residues using PSI-BLAST 44]; (ii) the residue interface propensity which describes the frequency of different
types of amino acids occurring in the interaction interface than on the protein surface
23].

For residues from the training set and the test set, we can construct feature vectors
which combine the new structural features and two additional sequence features. The
dimension of all features is 81.

RNA-binding residue prediction using ensemble method

RNA-binding residue prediction can be regarded as a classification problem when feature
vectors have been presented. In the learning process, a classification model can be
learned using feature vectors and class labels of residues from the training set.
Then, the classification model can be applied to predict binding propensities for
residues in the test set. Compared with individual classifiers, ensemble classifiers
have already been shown to produce better classification results 45], 46]. Specifically, in the problem of RNA-binding residue prediction, random forest, an
ensemble classifier, has already been adopted and showed a high performance.

In our method, ensemble learning technique is also used. ENTOOL 47] is a package which integrates a series of classification algorithms, which include
SVM, decision tree, ridge regression, Gaussian mixture models, multilayer perceptron,
etc. In our work, models of ridge regression, perceptron, and multilayer perceptron
are selected as constituent classifiers because they can achieve better performances
than other classifiers in ENTOOL.

Methods for prediction performance evaluation

ENTOOL first performs five-fold cross-validation on the training residues to adjust
parameters of the ensemble classifier and then predicts binding scores for target
residues using the trained models. The predicted binding scores vary from ?1 to 1.
The larger the binding score, the higher binding propensity of the target residue.

By comparing the predicted scores with the true labels of those residues in the test
set, four metrics can be computed: true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN). Based on the four metrics, false positive
rate (FPR) and true positive rate (TPR, which is also called sensitivity) can be computed
(see Eq. (6)). ROC curve can be created by plotting FPR values against TPR values. Other performance
metrics, such as AUC, accuracy, precision, specificity, F-score, and MCC can also
be computed (see Eq. (6)).

(6)