Copy number variations in the genome of the Qatari population

Detection of CNVs

To identify CNVs in the Qatari population, primary data was obtained from arrays and
whole-genome sequencing sources, and then called and integrated as described in methods
(Fig. 1). Briefly, CNV calls were first generated in the 100 individuals from Illumina’s
Omni2.5 M array intensity data using both cnvPartition (Illumina’s proprietary Beadstudio
plug-in) and QuantiSNP 27], and, separately, from NGS using cn.MOPS 28], in addition to CNV calls provided by Illumina from WGS data. Altogether, there were
two primary datasets called by 2 independent algorithms each, and all 4 subsequently
combined for final CNV calls as described below.

Fig. 1. CNV analysis strategy. CNV detection in Qataris was assessed at two tiers. First,
CNVs were called in 100 individuals using two algorithms each, on two primary input
datasets: genotyping array (OMNI2.5 M) and next-generation whole genome sequencing
reads (Illumina PE 100 bp, Mean Depth: 37X). A size cut-off of at least 5 consecutive
probes for genotyping data and at least 5 consecutive windows for whole genome sequencing
data was used to increase specificity (see Methods). Three samples with an unusually
high number of CNVs were first removed from the population (see Additional file 1: Figure S1). In the second step, high-quality CNVs from the remaining 97 subjects
called by all 4 platforms were distributed into 97 individual files. CNVs were first
compared intra-individuals and retained if observed by more than one algorithm. If
no overlap was detected within the individual, the CNV was compared inter-individuals
to detect a second occurrence in the remaining 97 individuals. CNVs observed only
once in the entire sample were discarded. CNVs passing these filters were merged across
the population to generate population level CNV regions (CNVRs), which were taken
into the detailed analysis steps. *Denotes data was provided as-is from proprietary
Illumina Genome Network sequencing pipeline without the ability of the user to alter
parameters

Preliminary qualitative inspection of the raw distribution of all CNVs in the 100
individuals revealed 3 outlier samples with a large excess of CNV calls (Additional
file 1: Figure S1). These individuals significantly skewed the average number of CNVs in
the population (Additional file 1: Table S1) and were therefore removed from further consideration, yielding a cohort
of 97 individuals in whom all subsequent analysis was conducted.

The 4 platforms initially identified a total of 536,889 CNVs from all 4 algorithms
in the 97 individuals, including 119,236 putative deletions [copy number class (CN)
0 or 1] and 417,653 putative duplications (CN 3 or 4+; Table 1). The excess of duplications over deletions is largely a result of CNVs provided
by Illumina’s proprietary WGS calls, which reported 314,656 duplications and 49,177
deletions, with no homozygous deletions (CN 0) called in the 97 samples.

Table 1. Copy Number Variations in the Qatari Population
a

In order to enhance specificity, we devised an approach to integrate CNV calls across
all 4 platforms, requiring a CNV to be observed at least twice to be retained (Fig. 1). In this step, each of the 536,889 raw, high quality CNVs was first compared to
all other CNVs detected by any of the four platforms within the same individual file,
and those observed twice (detected by 1 algorithm in the same individual) were included
in the ‘final’ variant file for that individual. All CNVs observed only once were
then compared across all other individuals to look for a second occurrence in another
individual. If found, that CNV was retained in the individual’s ‘final’ variant file,
or otherwise discarded. This allowed for significant refinement of the list of CNVs
in the population, eliminating all singleton CNV occurrences in the population – usually
the most enriched for spurious calls. Using this approach, the average individual’s
‘final’ genome had 1824 high-confidence CNVs, comprising 120 homozygous deletions,
628 single-copy deletions, 801 single-copy duplications and 275 amplifications (Table 1).

During this filtration and integration process, all CNVs were concurrently curated
to re-define breakpoints based on the source of CNVs. Briefly, whenever NGS-derived
CNVs overlapped array-CNVs, we used the NGS’s higher-resolution breakpoints to define
the start and/or end coordinate of the duplication or deletion. Wherever two CNVs
detected from the same platform were observed to overlap, the narrower breakpoints
were chosen, yielding a shorter, more conservative CNV call. After this curation,
the ‘final’ CNV content in the average Qatari genome affected a total of 29.9?×?10
6
non-redundant bases (Table 1). This is slightly lower than previously published estimates 22] and may reflect the strict filtration and breakpoint definition thresholds applied
in this study.

The median Qatari genome, based on the four different algorithms and platforms, contained
1815 high-resolution CNVs, covering an estimated total of 27,991,857 bp (Table 1). These were distributed into the 4 CN classes: CN–0 (homozygous deletions), 121
CNVs, affecting 1.1 Mb; CN–1 (single-allele deletions), 622 CNVs, affecting 5.8 Mb;
CN–3 (single-allele duplication), 801, affecting 16.8 Mb; and CN–4+ (amplification),
271, affecting 4.1 Mb. The excess number and larger size of duplications could be
explained by a higher proportion of duplications detected by our NGS-CNV callers (Table 1), and may reflect a combination of reads mapping to segmental duplications and the
fact that we included all multi-allelic CNVs with more than 4 copies in the same amplification
group 11]. Of 1815 CNVs, 1381 (76.1 %) were detected by NGS alone, 122 (6.7 %) were detected
by array technology only, and 312 (17.2 %) were detected by both, suggesting that
the two datasets may both be beneficial in representing the total CNV content within
an individual, and relying on only one may not be sufficient to cover all variation
(Additional file 1: Figure S2).

CNV distribution in Qatari subpopulations

The 97 individuals were examined in the context of the three Qatari ancestral subpopulations
(57 Q1 – Bedouin ancestry, 20 Q2 – Persian ancestry, and 20 Q3 – African ancestry).
In order to evaluate the accuracy of CNV calls, we initially used the CNVs detected
across all 97 individuals and performed principle component analysis. This analysis
separated individuals previously known to belong to Q1, Q2 and Q3 from genotyping
data into their three respective subpopulations based on CNV sharing. The PCA plot
showed some level of overlap between Q1 and Q2 clusters, which could be a result of
admixture and our assignment of ethnicity based on only 65 % of 48 SNPs (Additional
file 1: Figure S3 and details in methods), with Q3 (with the exception of 1 individual)
being the most clearly distinct subpopulation. These results are similar to those
obtained from a PCA plot using only SNPs, as published in 29]. The similarity of clustering using PCA on CNV and genotyping data in 97 Qataris
is consistent with a previous report demonstrating that PCA analysis based on high
quality CNVs yields similar clusters to one based on SNPs from the same individuals
30].

We then inspected the distribution of CNVs by frequency in each class per individual
(Additional file 1: Figure S4), and observed that, on average, individuals from all three subpopulations
had a similar range of CNVs in all four classes. However, in order to detect if the
three genetic subpopulations may have differences in the distribution by number or
size of CNVs in each CN class, probability curves were generated of CN number (Fig. 2a-d) and total size affected (Fig. 2e-h) within each CN class for each of the 3 subpopulations (as described in Methods).
For CN class 0 (homozygous deletions), these occurred at a significantly higher frequency
in Q1 and Q2 over Q3 (p?=?1.8×10
?6
and 1.2?×?10
?4
, respectively). However, this trend was reversed in amplifications (CN 4+), which
were found at a higher rate in Q3 than either Q1 or Q2 (p?=?1.5?×?10
?5
and 0.006, respectively). These observations may reflect higher consanguinity rates
in recent generations within Q1 and Q2, where enrichment in homozygous deletions (Fig. 2a) but depletion of amplifications vs Q3 (Fig. 2d) suggests that homozygous deletions are more harmful than multi-allelic, runaway
duplications, and may therefore have been purged from Q3 by purifying selection over
population history but only recently arisen in Q1 and Q2. This possibility is supported
by two further observations. First, for single-copy deletions (CN 1), we observed
a significantly higher number in Q3 (p?=?3?×?10
?7
and 1?×?10
?7vs Q1 and Q2, respectively) despite the depletion of homozygous deletions relative to
the other two subpopulations, suggesting higher diversity and less consanguinity in
recent generations among Q3 Qataris vs Q1 or Q2. Second, for Q1, we observe a slightly longer tail in the size of the genome
affected by single copy deletions (Fig. 2f) despite reduced number of CNVs in that class compared to Q3, suggesting these alleles
are larger in size and possibly more recent or more deleterious, causing this tail
of large CNVs to be absent in the homozygous subset of CNVs in Q1 (Fig. 2e).

Fig. 2. Probability distributions of CNVs by frequency and size in each copy number class
in 97 Qataris. Density curves showing the probability (y-axis) of a given individual
from each of the 3 subpopulations having a certain number of CNVs (a–d) or a certain cumulative size of the genome affected by CNVs (e–h) in each copy number class (a, e. CN?=?0; b, f. CN?=?1; c, g. CN?=?3; d, h. CN?=?4+). All p -values are calculated using the ANOVA-Tukey method. Black trace
– Q1, Blue trace – Q2, Red trace – Q3

Genomic impact of CNVRs in the genetic subpopulations

In order to evaluate the impact of duplications and deletions on each subpopulation
individually, we first separately merged deletions and duplications within each group
to detect subpopulation-specific CNV Regions (CNVRs). There were a total of 16,660
CNVRs in the 3 subpopulations; 12,709 (76.2 %) came from NGS data only, 1976 (11.9 %)
from array only, and 1975 (11.9 %) from both platforms combined (Additional file 1: Figure S2B; see Additional file 1: Additional Data). When deletions and duplications at the same locus (polymorphic
CNVRs) were combined, there were a total of 14,058 CNVRs, including 7092 deletions,
4885 duplications, and 2081 polymorphic CNVRs (Table 1).

In the Q1 subpopulation, there were a total of 5241 CNVRs of all CN classes, affecting
85.7 Mb of genomic content; in Q2, 4176 CNVRs affecting 65.8 Mb, and in Q3 4641 CNVRs
affecting 65.8 Mb (Table 1). The excess number and cumulative size of CNVRs in Q1 is likely due to the ~3-fold
higher number of individuals studied. As expected, the majority of CNVRs were sub-population
specific, with 3624, 3242 and 3633 CNVRs at low-frequency (affecting 1 to 20 % of
individuals) in Q1, Q2 and Q3 respectively, vs only 2657, 1715 and 1789 that were common (affecting 20 %).

Functional effect of CNV-affected genes in Q1, Q2 and Q3

In order to evaluate the functional effect of deletions and duplications separately
on the entire population, the polymorphic CNVRs were separated into their respective
CN classes (Table 2). In total, 16,660 CNVRs were observed in all four CN classes in the three subpopulations,
including 6281 in Q1, 4957 in Q2 and 5422 in Q3. In all three subpopulations, ~39-40 %
of all CNVRs were genic (2491 in Q1, 1995 in Q2 and 2085 in Q3), 4-5 % affected microRNA
loci (229 in Q1, 183 in Q2 and 180 in Q3), 13-15 % affected promoter sites (831 in
Q1, 647 in Q2 and 660 in Q3) and ~38-40 % affected transcription factor binding sites
(2573 in Q1, 1879 in Q2 and 2065 in Q3). We focused on genic CNVs in subsequent analysis
to determine the extent of CNV impact on genes and pathways and population burden
for genetic disease.

Table 2. Functional Annotation of CNV Regions in the Qatari Genetic Subpopulations
a

Genic pathway enrichment

The genes affected by CNVRs in all Qataris were evaluated by standard pathway analysis
against the KEGG pathway database using the DAVID bioinformatics suite 31]–33]. Among the top 15 pathways enriched for by genes affected by all CNVs in Qataris,
we observed several of potential concern for public health (Table 3). These included genes involved in starch and sugar metabolism, in the insulin signaling
pathway, and in type I and type II diabetes mellitus (Additional file 1: Figure S5A-E). Among these genes was the amylase enzyme AMY1, for which decreased
copy number was previously shown to be associated with obesity 34]. Of interest, 47 of 97 individuals in the cohort had type 2 diabetes (27 Q1, 10 Q2
and 10 Q3), but there was no statistical enrichment for any of these CNVs in obese
or diabetic individuals vs controls. This may be due to the low power in small sample size, combined with the
possibility that individuals labeled as controls have yet to develop diabetes due
to their young age at time of assessment (cohort average age 42 years, with 50 %
40 year). We also observed nominal enrichment in other medically relevant pathways,
including drug metabolism and non-small cell lung cancer (Table 3). Together, these observations suggest that CNVs in this population may affect public
health by contributing to the burden of chronic disease in the population and should
be assessed systematically in a larger cohort to establish power and assess significance.

Table 3. Top 15 KEGG Pathways Enriched in Genes Affected by CNVs in Qataris
a

CNVs affecting Mendelian disease genes

In order to determine whether CNVs may also play a role in rare disease in Qataris,
we compared all genes affected by CNVRs to the database of Online Mendelian Inheritance
in Man (OMIM). In all three subpopulations, approximately 10 % of all genic CNVRs
affected at least 1 OMIM gene (Table 2). The OMIM database contains a combination of disease causing genes, as well as disease-associated
genes and genes affecting polymorphic traits. Because we were most interested in genes
that have sufficient evidence of disease-causality from the literature, we re-annotated
all CNV-encompassed OMIM genes based on their published role in causing disease, and
then manually curated all putative OMIM-gene-containing CNVRs to determine the exact
number of exons that were likely to be disrupted by each CNV (contained within the
CNVs’ breakpoints).

The focus was on the subset of CNVRs most likely to have a functional impact on a
gene. These include deletions affecting any number of exons and duplications that
either encompass at least one entire gene (increased dosage) or are internal to the
gene (possibly disrupting protein translation frame). We therefore eliminated from
consideration all intronic events as well as duplications that were partially genic
(one breakpoint extending past the first or last exon with the other breakpoint inside
the gene). We then split the list of OMIM-affected genes into two groups: (1) genes
in which CNVRs had been previously reported; and (2) genes affected by novel, Qatari-specific
CNVRs. In the former group, we found a total of 46 disrupted disease-causing genes
(13 in deletions and 33 in duplications) affected by 40 unique CNVR loci (13 deletions
and 27 duplications) (Table 4). These CNVRs had variable distribution among the 3 subpopulations, with most being
specific to one or two populations while only 9 were observed in all 3 subpopulations.
Thus, the majority of Mendelian-disease-gene containing CNVRs is population-specific,
and may predispose to disease due to high levels of intra-population mating. Further,
though these CNVRs are marked as previously reported due to overlap with CNVRs in
the database of genomic variants (DGV) 35], it is possible due to the variable breakpoints of CNVs deposited in the DGV that
these Qatari CNVs affect different exons or occur at a higher frequency in this population
than the rest of the world.

Table 4. Qatari Genetic Subpopulation-specific Distribution of Known CNV Regions Deletions
Affecting Known Mendelian Disease Genes
a

We also examined OMIM-gene-containing CNVRs that were novel to Qataris. To determine
novelty here, Qatari CNVRs were compared to CNVRs reported in the 1000 Genomes Phase
I 36] study that were detected through next-generation sequencing with high-resolution
breakpoints. Only 14 Qatari CNVRs passed this filter, reflecting the high diversity
of populations represented in the 1000 Genomes data. These CNVRs included 9 deletions
and 5 duplications (Table 5). Five of these CNVRs were Qatari sub-population-specific, while nine were shared
by 2 or more sub-populations. Of the shared CNVRs, there were four deletions – one
of exon 47 in the Chediak-Higashi syndrome gene LYST (lysosomal trafficking regulator
gene) observed in one Q1 and one Q2 individuals, one in the glutaric acidemia gene
ETFDH (electron transfer flavoprotein dehydrogenase) in one Q1 and one Q2 individuals,
one in exons 2 to 3 of the alpha-methyl acetoacetic aciduria gene ACAT1 (acetyl-coA
acetyltransferase 1) in one Q1 and one Q3 individuals, and one in exons 1-7 of the
Gitelman Syndrome gene solute carrier 12, family member 3 (SLC12A3) observed in one
Q1 and two Q2 individuals. All of these disorders are autosomal recessive and these
deletions putatively truncate the genes and therefore predispose these subpopulations
to these diseases if present in homozygous state. Additionally, there was one disease-gene
affecting CNVR that was present in 7 individuals from all three subpopulations (5
Q1, 1 Q2 and 1 Q3), a 3 kb internal duplication of exons 13-14 of PMS2 (post-meiotic
segregation increased in S. cerevisiae 2), a gene in which mutations in both alleles are observed in patients with hereditary
nonpolyposis cancer and mismatch repair cancer syndrome. Additionally, 3 other individuals
(2 Q1, 1 Q2) had a smaller (2.7 kb) deletion affecting the same exons. In total, 10
individuals (10.3 % of the cohort) had a CNV not present in public databases that
putatively disrupts PMS2. Of note, colorectal cancer is the second most common cancer
in Qatari males and third most common in females 37]; whether this gene contributes to the burden of colorectal cancer in this population
is currently not known.

Table 5. Novel Qatari-specific CNVRs Affecting OMIM Disease Genes
a

Qatari CNVs affecting known disease cytobands

There has been substantial evidence implicating CNV mutations in a range of diseases,
including obesity, congenital heart disease and a variety of neuropsychiatric disorders
13]–24], 31]. In particular, there is a growing body of literature suggesting rare but recurrent
CNVs at several loci are responsible for a proportion of these diseases in sporadic
cohorts 14], 18], 20], 21], 23], 38]. We sought to determine the burden of CNVs by chromosomal cytoband to detect any
enrichment in Qatari Arabs over global cohorts. Because the database of genomic variants
(DGV) contains 200,000 CNVRs from 200 studies 35] detected using a wide variety of low- and high-resolution platforms, we limited this
comparison to CNVRs detected by an equally high-resolution platform (next-generation
sequencing) in the diverse 1000 Genomes Project phase I study (1000Gp1) 36]. All CNVRs reported in the 1000Gp1 dataset and from our study in each of the 3 Qatari
subpopulations were annotated by cytoband. Of 862 cytobands in the 24 human chromosomes,
769 contained CNVRs in the 1000 genomes samples; of these, 741 had CNVRs in Q1, 708
in Q2, and 735 in Q3. There were several cytobands observed in which unique CNVRs
were observed at a much higher frequency (1.5 to 10 times more non-overlapping CNVRs
per cytoband) in any one of the Qatari subpopulations than in the phase I data. Among
the top 10 cytobands (Table 6) with the highest enrichment were several disease-associated hotspot loci, including:
1q21.1 (Q1 p?=?1.6?×?10
?16
, Q2 p?=?9.67?×?10
?22
and Q3 p?=?2.4?×?10
?20
), in which recurrent CNVs have been observed in patients with schizophrenia and congenital
heart disease; 5q13.2 (p?=?3.4?×?10
?11
, 8.2?×?10
?14
and 1.8?×?10
?11
in Q1, Q2 and Q3 respectively), a locus associated with neurological disorders and
alcohol dependence, 16p11.2 (p?=?7.1?×?10
?9
, 8.1?×?10
?12
; and 1.2?×?10
?11
in Q1, Q2 and Q3 respectively), a locus highly associated with autism, schizophrenia
and childhood obesity; and 1p36.33 (p?=?2.87?×?10
?7
in Q3 only), a locus associated with disorders of sexual development and obesity 39]–43]. CNVRs in these loci may contribute to the collective burden of these disorders in
the Qatari population.

Table 6. Top 10 Cytobands in Which Qatari Genetic Subpopulations’ CNVRs were Observed at a
Significantly Higher Frequency than in 1000 Genomes Phase I CNV Data
a

Determining tagging probes for Qatari CNVRs

The Illumina OMNI2.5 M array was developed as a global population-based genotyping
tool to integrate variation down to minor allele frequency (MAF) 2.5 % from the diverse
populations represented in the 1000 Genomes Project 36]. We first sought to estimate the utility of the Illumina OMNI 2.5 M array for studying
the Qatari population. To assess the amount of “informative” SNPs on the OMNI2.5 M
array in this population, we used genotypes from a total of 108 Qataris genotyped
on this platform. The OMNI 2.5 M SNPs were first pruned for those that had a call
in at least 90 % of the 108 individuals (i.e., at most 11 individuals with “no calls”),
which resulted in 2,368,880 (99.5 % of all SNPs). Surprisingly, of these, 412,839
SNPs (17.4 %) were monomorphic (MAF?=?0 %) in 108 Qataris. Additionally, 676,116 (28.5 %)
had a MAF of less than 1 % in Qataris, and 1,028,842 (43.4 %) have a minor allele
frequency of less than 5 %. Therefore, in total less than 60 % of the SNPs on the
OMNI 2.5 M array adequately sample common variants in Qataris (Additional file 1: Table S2).

Nevertheless, we attempted to assess whether a subset of these SNPs tag common CNVs
in this population, which could be useful for imputation of Arab-population CNVs from
genetic data using this or similar arrays in future studies. In order to increase
specificity, we focused on 1193 common deletions (deleted allele observed at least
4 times) across the 97 individuals in this study, and investigated the pairwise correlation
between the deletion CNV and all SNPs within 500 kb either side of the deletion’s
breakpoints. As expected, while 1168 CNVs (98 %) had at least 1 SNP within 500 kb
of either breakpoint, the majority of SNPs from the OMNI 2.5 M array neighboring CNVs
did not adequately tag the deletion allele, with the majority of SNPs (~62 %) having
a maximum r
2
??0.5 (Fig. 3a). In fact, only 318 of 1193 deletions were tagged by at least one SNP at a Pearson
correlation of r
2
??0.7, of which only 195 (~16 %) were tagged at r
2
??0.9 (Additional file 1: Table S3). Further, of the 422 genic subset of CNVs within the 1193 deletions, only
35 % were tagged by an array SNPs with a correlation r
2
??0.5, and less than 7 % appeared in complete LD (r
2
??1.0). Therefore, the majority of deletions common to the Qatari population were
poorly tagged by the high density OMNI 2.5 M array, which could pose significant challenges
to using this or other commercial arrays to genotype CNVs in Arab populations.

Fig. 3. All SNPs within 500 kb of start and end breakpoints of 1,193 deletions were used to
detect for each deletion a SNP with the maximum pairwise LD correlation. This was
done both for a. all 1193 CNVs and b. for only 422 Genic CNVs. In both cases, the WGS SNVs significantly outperformed the
OMNI2.5 M SNPs, especially at higher r
2
values. WGS-SNVs: Whole genome sequencing detected variants (?). OMNI2.5 M-SNPs: SNPs
present on the OMNI2.5 M array (?)

In order to rectify this issue, we sought to determine a set of SNPs that could better
tag these CNVs by relying on genotypes obtained from the whole genome sequencing of
these 97 individuals (described in methods). All ~21 million high-quality variants
detected in 97 individuals were first pruned for those within 500 kb upstream and
downstream of each CNV breakpoint, and then LD measured between each CNV and all neighboring
SNPs within this window. There was a highly significant improvement of up to 250 %
for all CNVs and almost 1.5 times that (367 %) for genic CNVs at r
2
?=?1 (Fig. 3b and Additional file 1: Table S3). With whole genome sequencing SNPs, we observed 70 % of all deletions
being tagged by at least one SNP at an r
2
??0.5, and over 50 % at r
2
??0.8, suggesting these could be imputed in future experiments from sequence data.
In order to facilitate the design of new genotyping arrays that tag CNVs in this population,
we include a list of deletion-tagging genotypes at SNPs tagging 806 CNVs at r
2
??0.5 (Additional file 1: Table S4). We also include this information at greater detail in the accompanying
Additional file 2 containing the complete CNV dataset with all functional annotation in 97 Qataris.