PWHATSHAP: efficient haplotyping for future generation sequencing

In diploid individuals, such as humans, each chromosome exists in two copies, also referred to as haplotypes. One haplotype is inherited from the father while the other haplotype is inherited from the mother. Although these two copies are highly similar, they are not identical, reflecting the genetic differences between mother and father. A Single Nucleotide Polymorphism (SNP) is a variation of a single nucleotide that occurs at a specific position, called locus, in the pair of sequences. Given a set of heterozygous variants, i.e., loci where the two alleles differ, e.g. SNPs, the problem of assigning each of the two alleles at each locus to one of the two haplotypes is known as phasing.

Phasing SNPs is important for many applications. Haplotype-resolved genetic data allows studying epistatic interactions, for instance. Gene regulation and epigenetics have also been demonstrated to be haplotype specific in many instances [1]. One of the prime uses of haplotype panels, i.e., large sets of haplotypes present in a population, lies in the imputation of missing variants, which is instrumental for lowering costs and boosting power of genome-wide association studies [2]. Not surprisingly, constructing high-quality haplotype panels for human populations has been one of the central goals of several large-scale projects [36]. Further uses of haplotype data include studying evolutionary selection, population structure, loss of heterozygosity, and for determining the parental origin of de novo mutations. Refer to [7] for a detailed review of these and other applications.

Currently, the most prevalent phasing tools use genotype information for a large number of individuals as input. Therefore, phase information has not been observed directly, but is inferred based on the assumption that haplotype tracts are shared between individuals in a population. The resulting approaches are statistical in nature, based on, e.g., latent variable modeling [810], and Markov chain Monte Carlo (MCMC) techniques [11].

Noticeably, one of the major drawbacks of these statistical phasing methods is the lack of direct information that pairs of neighboring SNPs are on the same haplotype – something that is ultimately needed if one is chaining together the SNPs to form a pair of haplotypes. This can be provided by a sequencing read, i.e., a fragment of the actual DNA sequence. The existence of a read containing a pair of heterozygous SNPs is direct evidence that they come from the same haplotype. However, current sequencing technologies often do not provide long enough reads to sufficiently link neighboring SNPs. This is why the most widely used phasing methods are based on statistical information compiled from a large amount of data about the relationship between SNPs, such as linkage disequilibrium [12], or from patterns that arise in existing haplotypes, such as these aforementioned haplotype panels [3].

It is long reads that will really solve this problem, one of the major reasons for the recent interest in long-read technologies. While still not competitive in terms of per-base cost and error rates, and not yet sufficient to completely overcome the above drawbacks, cutting edge technologies such as PacBio’s Single Molecule Real Time Sequencing (SMRT) [13] or Oxford Nanopore Technology’s minION [14] are already on the market. This is only the beginning – these technologies will mature and improve, and other ones are under development. This might eventually enable routine use of haplotype-resolved sequencing in clinical diagnostics and pharmaceutical research. So, in the next decade, when long reads become cheap and widely available, this will push to the forefront those methods that phase SNPs based on read information alone, the so-called haplotype assembly methods, a research area that has, until now, remained mostly of theoretical interest [1517].

The haplotype assembly methods do exactly this: they assemble haplotypes from a set of sequencing reads. If two reads overlap on a SNP position, and their base-pairs at this position are different, i.e., they are “conflicting”, then one can deduce that they are on different alleles of the chromosome. The idea of this is that one can take this conflict information between pairs of reads to obtain a bipartitioning of the reads into two sets, i.e., the two alleles. This, combined with reads that link neighboring SNPs would give us a complete phasing of all SNPs, i.e., a set of haplotypes based on direct observation, in contrast to being based only on statistical information. This is where the long reads come in: they will someday provide this information, making haplotype assembly a much-needed tool for phasing.

Real data contains upstream errors, from the SNP calling phase, or the read-mapping phase, and so this becomes an optimisation problem: to obtain such a bipartitioning that involves correcting the minimum number of errors to the base-pairs of the reads. There are several different types of optimisation criteria in the literature, some of them equivalent. However we focus here on the minimum error correction (MEC) [18], as it is the most general of the criteria. Current read information is in the form of many short reads, that may pile up on certain SNP positions. Up to 2014, the current state-of-the-art of haplotype assembly methods [16, 17] solved MEC with approaches that scale, in terms of computational complexity, with the read-length. In addition to this drawback, these algorithms take advantage of the fact that many neighboring SNPs are not linked by these reads, because it allows to decompose this optimisation problem into independent subproblems. When reads get longer, these subproblems will no longer be independent – they should not be, since the goal is to link all of the SNPs. Also, a proportionally lesser coverage, i.e., the number of reads that cover a SNP position, will eventually be needed to obtain relevant information.

It is for these reasons that the authors of [19] introduced WHATSHAP, the first fixed-parameter tractable (FPT) [20] approach for solving the weighed minimum error correction (wMEC) [21] (and hence, the MEC problem) where coverage is the only parameter. The runtime of this approach is linear in the number of SNPs per read, which is the term that will increase by orders of magnitude as longer and longer reads become available.

A distinguishing feature of WHATSHAP with regards to the other currently available proposals is that it is exponential in the sequencing coverage and not in in read length. This appears to be very relevant when considering current trends in future generation sequencing technologies: technical improvements will clearly yield longer reads. The WHATSHAP algorithm has been implemented in a freely available toolkit (https://bitbucket.org/whatshap/whatshap).

Because WHATSHAP is the first approach in this promising direction, it appeared worthwhile to speed up its implementation by parallelising it. This paper presents PWHATSHAP, an optimised parallelisation of WHATSHAP, and its implementation in a toolkit which is also freely distributed (also available at https://bitbucket.org/whatshap/whatshap). PWHATSHAP has been a developing project, evolving together with the very active development of WHATSHAP. Preliminary results on the parallelisation experiment of the core structure of the algorithm were reported in [22]. In this paper we report on the parallelisation of the latest version of WHATSHAP, which has matured into an integrated framework engineered according to the current trends in genetic applications, and capable of analysing data in standard file formats (such as BAM and VCF) used in genomic analysis.

The merits of this work are:

  • The pWhatsHap project provides the research community with a freely available application, which can easily be embedded in analysis pipelines requiring the solution of haplotyping problems. The core of the parallel haplotyping algorithm consists of an advanced and optimised implementation tailored to multi-core architectures. Such an enhanced core has now been engineered in the integrated framework described above, supporting standard data formats. This is a major engineering step, requiring the embedding of several C++ core functions, coherently running as a parallel application, into a framework developed in Python. This allowed the pWhatsHap project to move from a prototype development phase to a mature, open-source product. Haplotyping can be typically employed in larger pipelines, for instance including other typically expensive steps, such as data acquisition and result analysis. The provision of efficient solutions to haplotyping, such as pWhatsHap, empowers more accurate analysis in all those contexts.

  • The incremental construction of haplotypes in WhatsHap is the type of algorithm whose parallelisation is very difficult. These algorithms process a large amount of data and are therefore sensitive to the availability of sufficiently large amounts of memory (RAM). Their exponential complexity (in time, but with direct implications on space complexity), and the huge datasets currently available, easily make memory availability a critical parameter. Parallelising one of the problems of this type represents an engineering challenge. The solution adopted is supported by the FastFlow framework [23], which provides high-level parallel programming constructs, such as skeletons and parallel design patterns. Thanks to the high-level programming paradigm adopted, it has been possible to build pWhatsHap retaining most of the overall structure and code of WhatsHap. The chosen paradigm has also the advantage to limit the need for mutual exclusion mechanisms, known to be typically slow. The clear performance improvement obtained supports the efficient treatment of large datasets and high coverage. It is important to note that the presented results can be obtained by computers that may easily equip current state-of-the-art genomic laboratories. Such improvement in the computational efficiency of haplotyping, made available at affordable costs, may be key in several analysis pipelines.

  • A comprehensive evaluation of the obtained results has been carried out, both theoretically and experimentally. First, the correctness of pWhatsHap has been validated against WhatsHap: both applications return identical results in terms of the wMEC score of the computed optimal solutions. Following correctness, the accuracy of pWhatsHap has been discussed in terms of the accuracy of WhatsHap, which is known to be strong. We discuss various aspects of the accuracy of WhatsHap and review the several constraints under which the competing approaches to haplotyping work. pWhatsHap emerges as an accurate and largely applicable approach. The efficiency of pWhatsHap is discussed against theoretical complexity results, and validated by means of experimental results over benchmark datasets. Overall, the large applicability and accuracy of pWhatsHap, together with its increased efficiency, make it a reference player in the quick developing quest for solutions to the haplotyping problem.

In the next section, Methods, the problem of haplotyping will be defined and the WHATSHAP approach described. Then, the details of the construction of PWHATSHAP are illustrated and the choices made in the engineering of the parallel solution discussed. An account of FastFlow, the supporting high-level parallel programming framework, concludes the section. The Results section evaluates the performances of PWHATSHAP. Two main parameters are considered to illustrate the validity of the developed application: accuracy of the returned results, and efficiency of the computation. The accuracy of PWHATSHAP builds on top of the accuracy of WHATSHAP, as discussed. Efficiency, instead, is demonstrated by means of suitable experimental results on benchmark datasets. Concluding remarks follow.