Structural alignment of protein descriptors – a combinatorial model

Algorithm 1 is a straightforward implementation of the combinatorial approach reducing the biological problem of the structural alignment of descriptors to the maximum-size assignment problem. It is a simplification of the real-world case, but resulting in quite satisfying accuracy, when we take into account its low computational complexity and only one alignment produced before the stage of final verification of its feasibility. In comparison to the exact, exponential-time backtracking algorithm it hits 81.9 % of the descriptors classified as structurally similar (for f=2.33), where the value is computed as the average for all mean coverages obtained for considered sets of descriptors of particular sizes.

Algorithm 2 differs from Algorithm 1 only in the detail that it produces a few potential alignments, one for each acceptable size K from a predefined range. The range is very narrow because for the smallest descriptors (sizes 3 and 4) only one value of parameter K, and for the biggest ones (sizes 10 and 11) three values of parameter K are considered. Thus, we obtain only one, two, or three potential alignments, later selected according to their feasibility. Such a little modification has allowed for a significant improvement of results, because the set of descriptor pairs qualified as structurally similar by the algorithm has grown, on average, to 91.4 % (for f=2.33).

The third algorithm built upon the proposed combinatorial model is the most enhanced and quality of its results is the highest. The percentage of hits to the set of optimal solutions of similar descriptor pairs, computed as above, is 93.1 % (for f=2.33). It is also better than previously mentioned algorithms for lower values of parameter f, and the difference in the mean coverage is even better noticeable.

The lower the value of f, the more restrictive selection, and for a certain K the higher probability that the solution will not be found. Therefore, a given pair of compared descriptors has a greater chance to be not classified as structurally similar. However, there is a need to set a low value of f in such cases, when one wants to tighten criteria and approve descriptor pairs, whose structural alignment is better than usual. The sense of a lower value of f is better noticeable for descriptors of a greater size. Thanks to a more stringent limit, greater alignments will be refused in favor of smaller ones, but the latter more likely fulfill the bound for the global RMSD. Unfortunately, the smallest descriptor sizes 3 and 4 do not allow for decreasing K and for them a lower value of f is rather not helpful (see detailed results in Additional file 1).

Algorithm 4 copes perfectly with sizes 3 and 4, because it explores the entire solution space and does not confine to one likely alignment. Algorithms 1–3 produce for such descriptors only one solution, which is later verified and, unfortunately often, refused. However, both kinds of the proposed approaches, the exact and polynomial-time ones, complement each other and are convenient to be applied for different ranges of sizes.

Processing efficiency of Algorithm 4 is quite satisfying, when we take into account its computational complexity. For the greatest descriptors (of size 11), it works ca. 424 s per positively classified instance, on average. However, the total time necessary for processing the entire descriptors set becomes huge. It becomes significant even for smaller sizes, for example 7, where processing all pairs within the set of cardinality 1494 takes almost two hours. When one considers analyzing all the descriptor sets at once, especially if the descriptor size can reach even 17, or protein domain libraries are broader than used here, the polynomial-time algorithms are the only option.

The main advantage of the heuristics, based on the combinatorial model, over the exact backtracking algorithm is significantly shorter processing time needed to achieve at least comparable and often indistinguishable results in terms of accuracy. These can be observed in the summary of results, obtained for top 10 of computationally expensive descriptor pairs, presented in Table St4. All considered descriptors were composed of 11 elements; therefore, they belong to the most structurally complex group, thus the processing time of the exact algorithm was significant. For 9 of 10 descriptor pairs, at least one of the proposed heuristics found the optimal solution. Variability within the algorithms’ output presented in the table is almost unnoticeable. Therefore, we present in Table St5 most often changing results, in the sense of the aligned residues ratio, for descriptors of different sizes. These results also allow for observing that the proposed heuristics are computationally efficient, especially for greater descriptors. The problem with finding a solution is observed for Hungarian method-driven algorithms if f=1.75. Detailed information about the greatest pair of descriptors from Table St5 is presented in Table St9. One can find there not only measurable features of the compared descriptors, but also their structural alignments obtained as results of Algorithms 1–4. Visualization of these structural alignments is presented in Fig. Sf7. As one can see, all the heuristic algorithms, although more or less advanced and differ in the output, produce valuable alignments very similar to the optimal solution.

In general, a protein descriptor is a short, discontinuous fragment of a polypeptide chain. Therefore, to solve the problem of structural comparison of descriptors one could use a general purpose method for solving the structural comparison of proteins. However, such a method should provide at the output an alignment constructed for a pair of protein descriptors in a form of unambiguous mapping of aligned residues rather than only a single similarity score. Moreover, a feasibility of the resultant alignment should be verified in the sense of ensuring that all the constraints specified in the problem definition of the structural comparison of descriptors resulting directly from their spatial topology are satisfied. Namely, (1) central elements are properly aligned if their residues are aligned exactly to each other (both central elements cannot be shorter and cannot be aligned to other fragments of the descriptors except themselves), (2) residues are properly aligned only when come from fully represented elements (i.e., if a particular element consists of five residues, then all of them must be included in the properly constructed alignment) and structurally similar duplexes (i.e., the RMSD value computed for them cannot be greater than 3.5 Å). Such an unambiguous mapping of aligned residues is provided by DEDAL [18], which is a web server designed for solving the protein structure alignment problem. This is a computationally efficient general purpose method that is driven by the structural comparison of protein descriptors. Therefore, we performed tests to prepare a comparison of the proposed algorithms with this tool, for our dataset of all 2022 structurally similar descriptor pairs. All alignments in which the residues included in the central elements were not properly aligned were discarded. All other improperly aligned residues were filtered out but the alignments obtained in such a way were considered for further analysis. Among all considered descriptor pairs, we found 1.93 % and 60.93 % cases in which the resultant alignment was not provided and was discarded as infeasible, respectively. For the rest, 37.14 %, the obtained alignment satisfied all the constraints specified in the problem definition of the structural comparison of descriptors. Within the latter set, for 69.24 % of pairs DEDAL found the optimal alignment, for 1.60 % of pairs the alignment length was also optimal (in the sense of aligned residues) but its RMSD value was slightly higher, and for 29.16 % of pairs the resultant alignment was shorter in comparison with the optimal counterpart. It must be stressed, however, that such a comparison provides only a rough view, because a general purpose method designed for solving the problem of the structural comparison of proteins cannot be aware of specific constraints resulting from spatial topology of descriptors that are crucial in the problem definition of the structural comparison of protein descriptors.

To demonstrate the flexibility of the tool in the identification mode, we did additional tests with changing expressions and element sizes. For two example residues, A123-VAL (d1e0ta1) and A30-PHE (d2f5ya1), structural motifs in their proximity were built with the element size equal to 3, 5, or 7. Common types of expressions for the identification of in-contact residues were chosen:

OR(DISTANCE:CBX ? 6.5, AND(DISTANCE:CBX ? DISTANCE:CA ? 0.75, DISTANCE:CBX ? 8.0))

and

OR(DISTANCE:SCGC ? 6.5, AND(DISTANCE:SCGC ? DISTANCE:CA ? 0.75, DISTANCE:SCGC ? 8.0)).

A spatial environment observed around these residues is interesting from the structural topology point of view, therefore allows for presenting diversity of 3D shapes that can be obtained through slight changes of values of input parameters. Visualization of 3D structures of the structural motifs constructed in the proximity of residue either A123-VAL (d1e0ta1) or A30-PHE (d2f5ya1) is presented in Figs. Sf4 and Sf5, respectively. Measurable features, such as numbers of segments, elements, and residues belonging to these structural motifs as well as their sequences are presented in Tables St6 and St7. As we can see, structural motifs constructed with the element size equal to 3 are generally too small to cover secondary structures. The number of segments belonging to these motifs is inversely correlated with the element size. Motifs built with the expression, where C ?
-extended point is used as a representative of the side-chain, are more general from definition, therefore more often can be identified even in nonhomologous proteins. Motifs constructed with the use of the geometrical center of side-chain are more specific, harder to identify in various protein 3D structures, and therefore they seem to be more promising to apply in a method for the single mode assessment. It should be emphasized that the tool provided here is highly configurable and allows the user to define own conditions that meet his expectations.

Descriptors for the tests were generated in the proximity of the residues analyzed previously supplemented with A2309-VAL (d2w0pa1). In Table St8, detailed information about these descriptors is presented, and visualization of their 3D structures is in Fig. Sf6.