Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters

Protein sequence database search

A performance measure commonly used in computational biology to evaluate Smith-Waterman implementations is cell updates per second (CUPS). A CUPS represents the time for a complete computation of one entry of the DP matrix, including all comparisons, additions and maxima operations.

We have scanned three protein sequence databases: (i) the 7.5 GB UniProtKB/Reviewed and Annotated (5,943,361,275 residues in 16,110,751 sequences), (ii) the 18 GB UniProtKB/TrEMBL (13,630,914,768 residues in 42,821,879 sequences), and (iii) the 37 GB merged Non-Redundant plus UniProtKB/TrEMBL (24,323,686,690 residues in 73,401,766 sequences) for query sequences with varying lengths. Query sequences used in our tests have the accession numbers P01008, P42357, P56418, P07756, P19096, P0C6B8, P08519, and Q9UKN1.

Performance on a single node

We have firstly compared the single-node performance of our methods to SWAPHI [8] and CUDASW++ 3.1 [26]. SWAPHI is another parallel Smith-Waterman algorithm on Xeon Phi-based neo-heterogeneous architectures. It is also implemented using the offload model. However, SWAPHI can only run search tasks on Xeon Phi; i.e. it does not exploit the computing power of multi-core CPUs. SWAPHI cannot handle search tasks for large-scale biological databases. In our tests, we find that the database size limitation for SWAPHI is less than the available RAM size; i.e. 16 GB. CUDASW++ 3.1 is currently the fastest available Smith-Waterman implementation for database searching. It makes use of the compute power of both the CPU and GPU. At the CPU side, CUDASW++ 3.1 carries out parallel database searching by invoking the SWIPE [18] program. It employs CUDA PTX SIMD video instructions to gain the data parallelism at the GPU side. The database size supported by CUDASW++ 3.1 is less than the memory size available on the GPU. Neither SWAPHI nor CUDASW++ 3.1 supports clusters.

For single-node tests, we have used the N
2 node (see Table 1) as test platform. In our experiments, we run our methods with 24 threads on two Intel E5-2620 v2 six-core 2.0 GHz CPUs and 240 threads on each Intel Xeon Phi 7110P respectively. We execute SWAPHI with 240 threads on each Xeon Phi 7110P. We have executed CUDASW++ 3.1 on another server with the same two Intel E5-2620 v2 six-core 2.0 GHz CPUs plus two Nvidia Tesla Kepler K40 GPUs with ECC enabled. 24 CPU threads are also used for CUDASW++ 3.1. If not specified, default parameters are used for both SWAPHI and CUDASW++ 3.1. Furthermore, all available compiler optimizations have been enabled. The parameters ?=10, and ?=2 have been used in our experiments. The substitution matrix used is BLOSUM62.

We have measured the time to compute the similarity matrices to calculate the computing CUPS values in our experiments. Figure 10
a shows the corresponding computing GCUPS values of our methods, SWAPHI and CUDASW++ 3.1 for searching the 7.5 GB UniProtKB/Reviewed and Annotated protein database using different query sequences. From Fig. 10
a we can see that the computing GCUPS of our multi-pass method is comparable to CUDASW++ 3.1. Both of them achieve better performance than SWAPHI.

https://static-content.springer.com/image/art%3A10.1186%2Fs12859-016-1128-0/MediaObjects/12859_2016_1128_Fig10_HTML.gif
Fig. 10

a performance comparison on a single node (N
2) between our method, CUDASW++v3.1 and SWAPHI. b performance results of our method using all three compute nodes

SWAPHI and CUDASW++ 3.1 cannot support search tasks for the 18 GB and 37 GB databases. Thus, we only use our methods to search them. Figure 10
a also reports the performance of our methods for searching these two databases. The results show that our methods can handle large-scale database search tasks efficiently.

Performance on a cluster

Figure 10
b shows the performance of our methods using all three cluster nodes. The result indicates that our methods exhibit good scalability in terms of sequence length and size, and number of compute nodes. Our method achieves a peak overall performance of 730 GCUPS on the Xeon Phi-based cluster.