HPG pore: an efficient and scalable framework for nanopore sequencing data


Features

HPG Pore has a number of features in common with the poRe and Poretools programs,
but also implements several useful unique features related to quality control and
other parameters of the sequence obtained, such as mean read quality, %GC, as well
as plots per base sequence content and read quality histograms, among others. Some
of the features that differentiate the programs originate in the different ways in
which data files are managed. For instance, poRe produces one individual file for
each sequence in the HDF5 file, which can cause problems with file systems quotas
if a large number of reads are present in the HDF5 file. In contrast HPG Pore produces
three files containing the three types of reads (template, complement and 2D), which
is more convenient for further mapping with other software. Table 2 summarizes the HPG Pore features and compares them to those implemented in poRe and
Poretools.

Table 2. Comparison of HPG Pore to the other tools available

Like poRe and Poretools, HPG Pore produces FastQ files that can be used for downstream
analysis with any conventional tool for read mapping and further variation (point
mutations 12], 13] or structural variants 14]) analysis, genome assembly 13], etc. Recently appeared programs, such as NanoOK 18], provides built-in downstream analysis with an environment in which alignment can
be carried out and different statistics can be obtained. However, the optimal benefit
would be obtained in a near future scenario in which downstream analysis tools can
natively run in the Hadoop environment. In order to avoid the transfer of HDF5 and
FastQ files to a local file system, we are currently implementing read mappers, such
as HPG Aligner 19], in Hadoop clusters.

Runtimes and scalability

Since different programs calculated different statistics, running times have been
calculated for the generation of FastQ files from the original HDF5 files. The programs
were ran in a Hadoop cluster with 8 nodes with 16 cores each (Intel Xeon CPU E5-2667
v2 @ 3.30GHz) and 64 GB of RAM and 12 TB distributed in 24 disks of 500GB.. We have
included this information in the paper. Our study shows that runtimes in poRe, Poretools
and HPG Pore (running locally) are approximately linearly dependent on the number
of sequences in the FAST5 file, with a trend towards an increased slope for high numbers
of sequences. HPG Pore runs the fastest, followed by Poretools, while poRe presents
remarkably slower execution times (see Fig. 2). A specific problem with poRe is that the large amount of sequence files that it
produces causes disk quota excess errors. To run the program with high number of reads
this parameter must specifically be changed in the file system.

Fig. 2. Runtimes of the three programs, poRe, Poretools, and HPG Pore, as a function of the
number of sequences in the FAST5 file

When HPG Pore runs in Hadoop mode it is faster than Poretools and poRe, despite an
initial delay due to the preparation of the Hadoop nodes and, as expected, the speed
is even faster when more nodes are available, thus it outperforms the other two programs
when running in local mode (see Fig. 2). The latency of the Hadoop framework (see https://goo.gl/ujNR9F) causes the paradox that the stand alone version of HPG Pore results slightly slower
than the Hadoop counterpart running on one node.

Since reads are randomly distributed across nodes in the Hadoop environment we do
not expect from parameters such as read length any specific effect of runtimes or
performance.

The Hadoop environment allows storage as well as speed to be scaled up. Figure 3 (upper panel) shows how runtimes decrease as the number of nodes available in the
cluster increases in four different scenarios: with 32,000, 100,000, 300,000 and 1
million sequences in the FAST5 file. The speed-ups are always over the ideal expected
acceleration (dotted line), and the increase in speed is clearly higher for larger
data sizes (Fig. 3, lower panel).

Fig. 3. Runtimes (upper panel) and increase in speed (lower panel) as the number of nodes increase in the Hadoop system in two different scenarios:
FAST5 file containing 32,000 (blue line), 100,000 (red line), 300,000 (green line) and 1 million (dark blue line) sequences. Dotted line in the lower panel represents the ideal speed-up according
to the number of nodes used. Speed-ups have been calculated using 3 nodes as the starting
point given that the 1 million reads could not be calculated for1 only one node