Integrating 400 million variants from 80,000 human samples with extensive annotations: towards a knowledge base to analyze disease cohorts

As high-throughput sequencing technologies become more widely employed, variants detected in large resequencing studies are continuously being published, including the 1000 Genomes Project, ESP6500, ExAC, and TCGA [1–4]. These variants differ from the ones targeted by genotyping arrays, in that most of them will initially not be properly annotated with genes, amino acid changes, impacts, associated diseases, or population frequencies. Individual and multi-sample data sets each require exhaustive annotation, using tools such as snpEff, ANNOVAR, or VEP [5–7], predictions of deleteriousness provided by SIFT, PolyPhen2, PROVEAN, and others [8–10] and curated databases such as dbSNP, ClinVar, and HGMD [11–13] to provide as detailed a picture as possible supporting interpretation on a sample-by-sample basis. Notably, for every set of newly called variants, current setups require the annotation of each variant from scratch: even though many variants were observed in earlier studies, aforementioned algorithms and database lookups will be run again on every new call set. Especially the computation of functional predictions and population frequencies are costly and need not be run on recurring variants.

By integrating the results of multiple sequencing efforts, covering a large number of healthy subjects, with such information, we can construct a repository that serves two major purposes: annotating large numbers of genetic variants by aforementioned tools and databases; as well as pooling variants and their frequency distributions in various populations. While the first is primarily aimed at decreasing the operations needed to fully annotate new studies, the second provides a fundamental basis for analyses of disease populations, surpassing the capabilities of each individual study to function as a reference population.

In this paper, our major goal is to build an infrastructure that allows centralized storage of every variant observed in resequencing studies, in-house projects, or known in curated databases. In this centralized storage, variants will be annotated once using a spectrum of tools for functional impact and predictions, as well as population frequencies, diseases-associations, pharmacogenetic information, literature mining, and so on. With each additional sequencing study, the amount of truly novel variants will become less—as shown, for example, for whole genomes [14]—, drastically decreasing the number of variants that have to run through any annotation pipeline. A data warehouse that incorporates sequencing results from thousands of individuals from various ethnic backgrounds and disease populations allows for fast cross-study analysis, such as differential mutation analyses, to discover novel genetic risk factors, gene–disease associations, potential disease mechanisms, and actionable variants [15–17]. The accumulated allele frequencies also help to gain an understanding of the distribution of disease-associated variants in reference populations.

Our second goal is to achieve a platform-independent solution, referring to data storage and computation infrastructure: relational databases, NoSQL, compute clusters, and Hadoop, each of which has its particular benefits for storage, indexing, querying, integration, or computation: some platforms are better suited to run secondary analysis pipelines and to call variants, some are better suited for computing allele frequencies across studies, some will be used to run graphical, interactive user interfaces, some to store and access summarized data, some to store per-individual data. We argue that such an endeavor requires a mechanism to compute a globally unique key for each normalized variant independently on each platform1. This will allow to easily map between every genetic variant resource employed across the entire infrastructure.

In summary, the functionality we present with the Reference Variant Store includes

  • data from various large resequencing studies and annotation databases;

  • extensive annotations including population frequencies, clinical significance, and predictions of functional impact;

  • integrated analysis of disease versus healthy populations;

  • a reversible variant key that uniquely identifies SNVs, MNVs, and indels, and that can be computed solely based on location and alleles;

  • a RESTful web service to access bulk data programmatically; and

  • per-sample information stored on Apache Hadoop allowing for fast computation of allele frequencies across populations, linkage disequilibrium, and population stratification.

We have so far populated RVS with variants from diverse resources shown in Table 1: RVS currently contains 473 million distinct variants at 389 million sites; 399 million of these variants have been observed in at least one of the studies we integrated; the remainder are largely hypothetical SNVs from dbNSFP2 [18].

Table 1

Number of variants imported from various external resources

The first block refers to sequencing/genotyping studies, the second to sample-independent annotation databases. “Unique to study” counts variants that were observed only in that particular study. “Variants passed” refers to variants that passed quality metrics as defined by the particular study, at least one sample has to pass; n/a: individual sample quality metrics not available. Totals exclude duplicates seen in different studies. Variants in annotation databases are included only if they can be mapped to precise coordinates and allele. Since a large proportion of the variants discovered by literature mining are given at the protein level only, they were not compared to other studies

adbNSFP contains hypothetical variants, see text

bExAC includes samples from 1000 Genomes, ESP6500, and TCGA

cNote that data from HGMD, PharmGKB, UK10K diseases and TCGA germline are not visible to external users on the RVS website

dCounts for SwissVar refer to distinct amino acid changes. Further details on individual resources are provided in Additional file 4: Table S3

Observed variants originate from 82,600 samples: 5,600 whole genomes, 66,000 whole exomes, and 11,000 genotyped samples. We also included variants that are annotated independent of samples, from resources such as ClinVar, OMIM, COSMIC, and the literature, adding to the observed and hypothetical variants.

The remainder of this paper is organized as follows. After presenting work closely related to ours, we shall provide details on the data sources and genetic variants imported into the Reference Variant Store so far, and show summary statistics as to variant types and impacts. We will then discuss applications and future directions for our work. We shall then explain the architecture and the workflows in RVS that support storage, annotation and loading of novel variants. We will lastly present details on the allele-specific variant key and literature mining.

Related work

Several efforts share some of our goals in bringing together variants and annotations from large-scale sequencing studies. Chennagiri et al. [19] presented an idea to store genetic variants in a database for fast access, reduce redundancy, and Sanger benchmarking. They loaded more than 9000 samples from VCF files, including population frequencies from an early release of 1000 Genomes data clinical samples, and additional Sanger sequencing data. Annotations encountered in VCF files are stored as key–value pairs to support arbitrary tags. For RVS, we want to obtain (sub-) population frequencies, including disease population, from as many studies as possible. Clinical samples cover a variety of indications and originate from in-house and many external studies, genotyping and sequencing alike. We also enrich our annotations with by integrating renowned resources, such as ClinVar and OMIM.

CanvasDB3 is a local infrastructure supporting the analysis of resequencing projects, using MySQL for storage and providing an R interface for analysis [20]. As one major difference to RVS, CanvasDB stores the entirety of sample-specific genotypes, such as 1092 samples from the 1000 Genomes Project data. Users of CanvasDB can therefore perform SEQ-GWAS cohort analyses, defining cohorts on-the-fly and factoring in disease populations or family structures and the like. CanvasDB can be used as a fast and powerful filtering tool to analyze groups of samples. RVS aims at having data from several large cohort studies as well as various sources of annotation readily available for interpretation of observed variants.

GEMINI is a software package designed for exploring variation in personal genomes and family based genetic studies [21]. It utilizes resources such as KEGG and ENCODE for annotation of genes and ClinVar for variants. Once the local hosting solution is set up, users can import single samples or larger studies to store individual genotypes. Complex queries allow to find variants meeting different inheritance patterns, or run burden calculations. With RVS, in contrast, our focus is on providing detailed variant annotation on large numbers of preloaded variants and data from several large sequencing studies are readily available to the user; however, RVS currently does not store data by individual sample.

The Exome Aggregation Consortium recently presented their effort to make genetic variation data observed in 63,358 whole exomes publicly available [3]. ExAC brings together data from healthy and disease populations and can be searched by gene, variant, or dbSNP to show population frequencies and other annotations such as affected transcripts or disease association according to ClinVar. They also offer quality metrics to inform users about the reliability of calls, such as read depths histograms obtained from samples interrogated at each site. Contributing projects to date range from the 1000 Genomes and ESP to TCGA, Swedish Schizophrenia and Bipolar Studies, and several type 2 diabetes studies.

EVA, the European Variation Archive, collects highly detailed, granular, raw variant data from human (with other species to follow) [22, 23]. Types of genetic variation data include short as well as structural variations. EVA provides a web-based browser to query the entirety of variants for studies, genes, frequencies, and raw data, such as from VCFs. One of the benefits of EVA is that it allows users to submit variants obtained in their own studies by sample, supporting pedigree information as well. The focus of RVS in addition to the collection of variants lies on extensive annotation, in terms of population frequencies, clinical significance, predicted impacts, and so on.

The SG-ADVISER [24] is a standalone application that retrieves annotations for variants, including copy number, from a web-server on the fly. The back-end of SG-ADVISER utilizes a combination of precomputed data and high-performance computation on demand. Similar to RVS, the results include coding and protein impact, splicing impact, allele frequencies, and clinical annotations; in addition, data on regulatory variants, genomic regions, ontological information on processes, functions, and pathways are available.