ConEVA: a toolbox for comprehensive assessment of protein contacts

The success of many protein residue contact prediction methods, in the recent years, has kindled a new hope to solve the long standing problem of ab initio protein structure prediction [16]. Consequently, contact-guided ab initio structure prediction has emerged as an important field. When accurately predicted contacts are supplied as input to structure prediction or reconstruction methods, accurate folds can be predicted consistently [1, 79]. In general, accurate contacts lead to accurate structural models. However, for predicting folds of sequences which do not have homologous templates (hard sequences), the optimal way of utilizing predicted contacts is still an ongoing research. For instance, experiments on true contact reconstruction have suggested that 9 Å or more distance threshold delivers best reconstruction with C? atom [10, 11], but the Critical Assessment of Protein Structure Prediction (CASP)’s definition of 8 Å threshold is still widely used to predict contacts [13, 6, 12]. Marks et al. have even demonstrated successful structure predictions using C? atoms and 7 Å threshold for defining contacts [13]. Similarly, it is widely accepted that long-range contacts [12, 14, 15] are the most useful of the three contact types (short-, medium-, and long-range), but some structural domains introduced in CASP like T0765-D1, T0709-D1, T0711-D1, T0756-D2, T0700-D1 have very few or no long-range contacts at all. In addition, Michel et al. discuss some examples of proteins that could not be accurately reconstructed despite high accuracy of predicted contacts in their PconsFold method [16]. Using the protein 1JWQ, Vassura et al. show how some structures cannot be folded with distance thresholds below 16 Å [10]. Zhang et al. report folding 90 transmembrane proteins at 14 Å cut-off [17]. Furthermore, in these works, no common agreement is found on the optimal number of contacts (or a range) needed for accurate reconstruction.

Hence, a tool to study the relationship between contact parameters and structure types is deemed necessary. Currently, for evaluating predicted contacts, the three most widely used evaluation measures are precision, coverage and distance distribution score (Xd) [3, 12, 14, 1822]. In addition, other measures like ‘mean false positive error’, ‘distance in contact map’ or ‘spread’ [13], F-score and Matthews correlation coefficient (MCC) [12] are also used for a more rigorous evaluation of the predicted contacts. Osvaldo et al. [23] had published EVAcon in 2005 that could calculate some of these measures, which no longer seems accessible. On the other hand, existing tools like CMView [24] and CoeViz [25] only enable contact map visualization and multiple sequence visualization.

In this paper, we present ConEVA, a fast web application (along with a downloadable tool) for protein contact evaluation and comparison. Besides the server, we also report some of our observations obtained through the application of our tool on larger data sets. We discuss how the length of a protein can influence various evaluation measures, the minimum number of contacts to evaluate, and the range of the evaluation measure values associated with the determination of the correct fold of a protein.