PredSTP: a highly accurate SVM based model to predict sequential cystine stabilized peptides

Known STP sequence collection

Sequence of ICKs and cyclotides (knotted STPs) were collected from the Knottin database
(http://knottin.cbs.cnrs.fr/) and 167 sequences with solved 3D structures were obtained from this source. An additional
36 sequences of nonknotted STPs with known 3D structures were collected from PDB with
90 % sequence identity (http://www.rcsb.org/, June, 2013). Our total set of 204 candidate sequences (167 from the knottin database
and 37 from PDB) were further reduced to remove redundant sequences, defined as sequences
sharing???90 % sequence identity using CD-HIT 54], 55]. A total of 108 sequences were retained from the knottin database set and 36 sequences
were from the PDB set, leaving 144 canonical STPs (Additional file 1: Supplement 1). The mean, standard deviation and range of the number of residues
in the positive training set are 42.20, 15.70 and 23–143, respectively, with an average
number of 6 cysteines per chain.

Control negative sequence collection

Sequences classified as negative control were collected from PDB using a criterion
that was species agnostic and stipulated the exclusion of STPs through positive matches
to PDB small proteins (Additional file 1: Supplement 2). 393 sequences were classified as non-STP sequences for the purposes
of this study. The mean, standard deviation and range of the number of residues in
the chains of the negative training set are 63.16, 25.92 and 9–160, respectively,
with an average number of 6 cysteines per chain.

Independent test sequence collection

Seven independent sets of sequences were collected to verify the robustness of the
model (Table 1). Among these were sets classified according to Protein Data Bank (PDB, July 2013)
criteria as Eukaryote, Bacteria, Archaea, Virus and Unassigned. In addition, a set
of proteins whose sequences were recently solved by NMR and deposited in PDB (July
04, 2012 to March 25, 2014) (NewNMR751) and also the Structural Classification of
Protein (SCOP) PDB subset were used (Smallprotein163). Small protein sequences were
retrieved with the following parameters: (a) resolution??1.5 Ã…, (b) protein chain
but not DNA/RNA/Hybrid, and (c) limited to small disulfide rich proteins and have
similarity in size, number of disulfide bonds, cystine number and cystine arrangements
in their primary structure. The result included STPs, rubredoxins, BPTI-like, snake
toxin-like, crambin-like, insulin-like, and high potential iron proteins among others.

Table 1. Description of independent test sets analyzed by the new model (PredSTP)

Defining the putative STP cystine motif

STP motifs consist of six cysteine residues (C1–C6) flanked by varying number of non-cysteine
residues (Fig. 1). This set of consecutive cysteines is identified here by elucidating the distance
between each consecutive pair of cysteines, i and i?+?1 as ?C i,i+1
(cysteine loops). Based on our global analysis of STP motifs, if the min(?C i,i+1
) is greater than three, then the motif is not considered to contain a STP and is
discarded (Additional file 1: Figure S1). Likewise, if the min(?C i,i+1
) is less than or equal to three and located between C1 and C2 or C2 and C3 the motifs
are disregarded as these motifs are often found within electron transport-like proteins
such as ferredoxin, rubredoxin, and iron-sulfur proteins 56], 57]. Otherwise, the min(?C i,i+1
) was defined to exist between cysteines C3 and C4. This default pair of cysteines
is shifted to a higher pair of cysteines if there exist less than 2 additional c-terminus
cysteines. For example, if after the default C3 and C4 cysteines are identified, there
is only one c-terminus cysteine, then the min(?C i,i+1
) is defined as cysteines C4 and C5.

Proximity Length (P) and Normalized Proximity Length (NP)

After putative STP motifs are identified, a set of three proximity lengths are calculated:
P 1
?=??C 1,4 ; P 2
?
=??C 2,5 ; P 3
?
=??C 3,6
. Motifs of less than six cysteines, or motifs defined as invalid by our criteria,
were assigned P 1
?=?P 2
?
=?P 3
?
=?0. A Normalized Proximity Length (NP) was then assigned for each proximity length,
P, resulting in three new values: NP 1 , NP 2 , and NP 3
. The NP identifies the distance from the observed mean proximity lengths of known
STPs to the corresponding bonded cysteines involved in STP cysteine loops in the training
set. For example, the average P for all STP sequences in the training set is subtracted from the calculated P value associated with its corresponding proximity length and normalized as described
in Eq.1, where is the average of the proximity lengths of known STPs derived from the training set.

(1)

Detecting least loop length ratio

The least loop length is defined as the min(?C i,i+1
) divided by the total length of the peptide. This feature is used as part of feature
sets 5 and 6, see Additional file 1: Supplement 3.

Detecting presence of amino acid between C4–C5 and C5–C6

Data published describing loop lengths of ICKs and cyclotides, which comprise a large
subset of STPs 21], motivated a Boolean feature for the presence of inter-loop amino acids. A result
of ‘true’ is returned if there is a presence of a minimum of one amino acid in both
of the last two loops (C4–C5 and C5–C6) in a putative STP motif.

Algorithm

We used a Support Vector Machine (SVM) classifier/predictor implementation to elucidate
STP toxins. The SVM was implemented using the e1071 library in R (2.15.1). Feature
sets were assigned as described in the Additional file 1: Supplement 3, and sensitivity, specificity, precision and accuracy were determined after ten-fold
cross validation. Initial gamma and cost were set to 0.1 and 0.1, respectively, with
the best output at 0.0587. Given 144 STP and 393 non-STP chains, 100 and 300 random
samples were chosen, respectively, for a training set over 200 iterations. Feature
sets were prioritized based on accuracy.

STP sequences were predicted from the test sets described previously (Table 1) using feature set 6. Due to the limited throughput of the Knoter1D interface, only
the “NewNMR751” and “Smallprotein163” (predicted STP chains from the SCOPs derived
subset) predictions where compared against Knoter 1D predictions (http://knottin.cbs.cnrs.fr/Tools_1D.php) and validated with Jmol by analyzing the disulfide connectivity using the corresponding
PDB files. Results from only the eukaryotic test sets were filtered to remove sequences
with???30 % chain identity and compared against Jmol analysis. Chains exhibiting canonical
STP connectivity (C1–C4, C2–C5, C3–C6) were initially considered as true positives.
True positives were further cross matched with their PDB annotations to make the final
confirmation.

Confusion matrix creation

A confusion matrix was created to perform the cross validation test. True Positives
(TP), False Positives (FP), True Negatives (TN) and False Negatives (FN) were determined
from the confusion matrix. Sensitivity [TP/(TP?+?FN)], specificity [TN/(TN?+?FP)],
precision [TP/(TP?+?FP)], accuracy [(TP?+?TN)/(TP?+?FN?+?TN?+?FP)] and Mathews Correlation
Coefficient (MCC) [(TPXTN-FPXFN)/sqrt{(TP?+?FP)(TP?+?FN)(TN?+?FP)(TN?+?FN) were calculated
to evaluate the performance of the algorithm.

PSI BLAST

The BLAST suite (blast-2.2.29+) was installed on a local machine along with the appropriate
dataset. The dataset was the chains of proteins deposited in PDB, solved by the NMR
method, from July 04, 2012 to March 25, 2014. The selected threshold e-values PSI
BLAST 58] were 0.01, 0.1 and 0.5. The number of iterations for PSI BLAST was 5. All other parameters
were set as default.