Predicting the combined effect of multiple genetic variants

Scoring multiple SNPs

Processing the variants from the 1000 Genomes project resulted in scoring 67,109 SNP
sets. A SNP set may be formed from different transcripts, which results in multiple
scores for a set (there are 91,970 set scores in total). For a SNP set and transcript
pair, HMMvar measures the deleterious effect of the SNP set using the original transcript
as the wild type sequence. 291,662 single variants from those SNP sets were gathered
and scored. The mean set score distribution is significantly different from the single
variant score distribution (one-tailed Wilcoxon rank-sum test, p2.2×10
?16
). One thousand SNP set scores and 1000 single SNP scores are repeatedly sampled from
91,970 set scores and 275,840 single SNP scores. The cumulative distribution functions
of the means of the set scores and single scores are shown in Fig. 2. The SNP sets are more likely to be scored higher than those of single SNPs. The
density of SNP set scores tend to be higher than the density for individual SNP scores
on both ends.

Fig. 2. Comparison between variant set score (black) and single variant score (red)

Let V = {v1
,v2
,…,v n
} be a set of variants v i
(1?i?n), S denote the HMMvar score of the set V, and s1
, s2
, …, s n
be the corresponding single variant scores of v1
, v2
, …, v n
, respectively. Define V as a compensatory mutation (CM) set if S? min{s1
,s2
,…,s n
}?1.5(max{s1
,s2
,…,s n
}? min{s1
,s2
,…,s n
}). One hundred eighteen CM sets were obtained from the data set. The CM sets indicate
that the deleterious effect of a single variant is compensated by combining it with
other variants.

Define V as a noncompensatory mutation (nonCM) set if S? max{s1
,s2
,…,s n
}+1.5(max{s1
,s2
,…,s n
}? min{s1
,s2
,…,s n
}). Two thousand three hundred ninety-two nonCM sets were obtained from the data set.
The nonCM sets indicate the joint effect of multiple neutral variants could possibly
result in deleterious effect.

To investigate the single variants in the CM and nonCM sets, all the single variants
from all the CM sets and all the nonCM sets are gathered, respectively. The allele
frequency distributions from these two groups are compared in Fig. 3. When the allele frequency is less than 0.1, the proportion of the nonCM variants
is greater than that of the CM variants. This is probably because the single variants
are so deleterious that in most of cases, the joint effect of these deleterious variants
is still deleterious. However, when the allele frequency is in the range of 0.1 to
0.3, the signal of the compensatory mutation effect is boosted.

Fig. 3. Allele frequency distribution of SNP variants in CM sets and nonCM sets

As a test case for HMMvar’s capability in predicting the effect of multiple variants
compared to the effect of single variants, the multiple mutations that have been shown
to increase the severity of cardiovascular disease from single mutations are scored,
in ?-myosin heavy chain (MHC) and myosin-binding protein C (MyBP-C) genes. Studies have
shown that single mutations in these two genes can lead to genetic cardiovascular
disease, and multiple mutations on these same genes can lead to more severe cardiovascular
disorders and even death 17]. As shown in Table 3 for both genes, compound mutations all have higher HMMvar scores than single mutations,
consistent with the notion that compound mutations in these genes cause more severe
cardiovascular disease than single mutations. The set score effectively reflects the
cumulative effects of the single mutations. The maximum score for compound missense
mutations in the ?MHC gene is the combination of Arg719Trp and Met349Thr, which has been reported causing
sudden death 17].

Table 3. Scoring multiple mutations in ? MHC and MyBP-C genes

Scoring compensatory indels

From TP53, 850 variants were found that met the criterion for belonging to a compensatory
indel set, out of 3565 variants. The deleterious functional effects caused by these
variants can be greatly weakened by compensatory indels as measured by HMMvar scores.
There may be different compensatory indel sets for a given single variant due to different
combinations. Figure 4a shows the HMMvar score of a single variant versus the median of the HMMvar scores
of the corresponding compensatory indel sets. It is obvious that most of the deleterious
variants (high HMMvar scores) are neutralized by the compensatory indel sets (low
HMMvar scores ?1).

Fig. 4. Scatter plot of HMMvar score of a single variant versus the median HMMvar score of
the corresponding compensatory indel sets for the TP53 gene and the PTEN gene. The
red line is y=x. a TP53 compensatory indels. b PTEN compensatory indels. The red solid circle marks the COSMIC variant with ID 428080

PTEN is also an intensively studied tumor suppressor gene. Figure 4b shows the HMMvar score of 246 variants versus the median HMMvar score of the corresponding
compensatory indel sets, which shows the same trend as the TP53 variants. This scoring
procedure provides candidate compensatory indel sets, which when substituted for the
indel, ameliorate the deleterious effect of that single mutation. For instance, the
deleterious variant c.142delA (COSMIC428080) associated with skin cancer 18] has HMMvar score 1.75; however, with compensatory indels, the deleterious effect
can be lessened to a HMMvar score of 1.07. At the same time, the results here demonstrate
the importance of scoring multiple variants together, instead of individually, to
understand their joint effect.