NGS-Logistics: data infrastructure for efficient analysis of NGS sequence variants across multiple centers

Next-Generation Sequencing (NGS) is a key tool in genomics, in particular in research
and diagnostics of human Mendelian, oligogenic, and complex disorders 1]. Multiple projects now aim at mapping the human genetic variation on a large scale,
such as the 1,000 Genomes Project, the UK 100k Genome Project. Meanwhile with the
dramatic decrease of the price and turnaround time, large amounts of human sequencing
data have been generated over the past decade 2]. As of January 2014, about 2,555 sequencers were spread over 920 centers across the
world 3]. As a result, about 100,000 human exome have been sequenced so far 4]. Crucially, the speed at which NGS data is produced greatly surpasses Moore’s law
5] and challenges our ability to conveniently store, exchange, and analyze this data.
Data pre-processing is needed to extract reliable information from sequencing data
and it can be divided into two major steps: primary analysis (image analysis and base
calling) and secondary analysis. When looking for variation in the human genome, secondary
analysis consists of aligning/mapping the reads against the reference genome and scanning
the alignment for variation. Both raw data and mapped reads are large files occupying
significant disk storage space. The collection of files resulting from the analysis
of a single whole genome study can take up to 50Gb of disk space. This raises significant
issues in terms of computing and data storage and transfer, with off-site data transfer
currently being a key bottleneck. Moreover, the analysis of NGS data also raises the
major challenge of how to reconcile federated analysis of personal genomic data and
confidentiality of data to protect privacy. In many situations, the analysis of data
from a single study alone will be much less powerful than if it can be correlated
with other studies. In particular, when investigating a mutation of interest, it is
extremely useful to obtain data about other patients or controls sharing similar mutations.
However, personal genome data (whole genome, exome, transcriptome data, etc.) is sensitive
personal data. Confidentiality of this data must be guaranteed at all times and only
duly authorized researchers should access such personal data.