Bayesian models for syndrome- and gene-specific probabilities of novel variant pathogenicity

Ethical approval was not required, as this study employed reanalyses of published
data in the public domain.

For each syndrome (LQTS, BrS and HCM), we identified a set of genes known to harbour
pathogenic variants for that syndrome and, for each gene, derived prior odds that
a rare variant is pathogenic. Next, we identified relevant predictors and used them
to construct Bayesian logistic regression models, trained using a set of well-characterised
rare variants. We now describe these steps in more detail.

Prior odds of pathogenicity

The probability that an observed variant is pathogenic depends on the gene in which
it is located and the syndrome of the patient. In a patient with LQTS for example,
a variant in KCNQ1 (a gene implicated in a large proportion of LQTS cases) has a higher probability
of pathogenicity than a variant in KCNJ5 (which has only been implicated in a handful of cases). Similarly, prior evidence
indicates that a variant in the gene MYH7 has a substantial probability to be pathogenic in an HCM patient, but a low probability
of pathogenicity for LQTS.

The prior odds of pathogenicity for a rare variant in gene G, found in an affected
individual, might be assumed to be:

(3)

where burden means the proportion of individuals who have a variant in G of the type
stated. Because the pathogenic/benign status of variants is often not known with certainty,
we cannot directly estimate either the numerator or denominator of Equation (3). However, it is reasonable to assume that the burden of benign rare variants in
cases is equal to the burden of rare variants in controls, which we estimate from
population sequence data generated by the Exome Sequencing Project 20] (Table 1). Therefore, we could replace the right-hand side of Equation (3) with:

(4)

Table 1. Estimated burden of rare variants in cases and controls, for the calculation of prior
odds of pathogenicity

The burden of rare variants in cases is estimated primarily from literature reports
of the yields of diagnostic genetic testing for each gene and syndrome 21]–24], as detailed in Additional file 1: Table S4 and summarised in Table 1.

In practice (Table 1), we made some adjustments to Equation (4). If any estimated burden of rare variants in the controls was zero, it was replaced
by the minimum non-zero frequency (0.0002). This was done under the assumption that
the minimum frequency observed corresponds to a variant count of 1. An alternative
approach would be to transform frequencies with (x+1)/n where x is the count of variants and n is the number of individuals considered, but this information is not uniformly available.
When the prior odds obtained from Equation (4) were smaller than 0.2, they were replaced by 0.2. Such a truncation was performed
because estimation of the burden of rare variants in cases gets less precise in the
lower range of values, and random fluctuations in estimated control frequencies have
more impact upon the prior odds within this range of rare variant frequencies in cases.
The threshold for truncation of prior odds was set to 0.2, following simulation experiments
demonstrating that predictions were equally accurate when the prior odds were set
at either 1 or 0.2.

We employed an alternative approach to calculating prior odds for five LQTS genes
(KCNQ1, KCNH2, SCN5A, KCNE1 and KCNE2) that is not based on the literature reports given in Additional file 1: Table S4. We obtained independent burden estimates from a published prospective
case series 25] comprising 2,500 unrelated cases referred for LQTS clinical genetic testing.

For the five LQTS genes with case series data, Table 1 shows that the two approaches yield different estimates, as the burden of pathogenic
variants in individuals referred for LQTS genetic testing in the clinical case series
(30%) was lower than that typically reported in research studies (70% to 75%) 21],22]. It is likely that clinicians employ different thresholds for defining cases in a
research setting or subspecialist clinical centre than are employed for day-to-day
referrals to a clinical genetic diagnostics service. For better performance in the
latter, more challenging, setting, we based our rare variant burden estimates on the
case series. Given that these estimates were on average about half those predicted
from the literature when both estimates were available, we reduced by a factor of
two the prior odds derived from literature-based estimates for the other eight LQTS
genes (ANK2, KCNJ2, CACNA1C, CAV3, SCN4B, AKAP9, SNTA1 and KCNJ5).

Prior odds for HCM- and BrS-related genes were estimated using only information from
the literature, in the absence of additional prospective case series for these syndromes,
as summarised in Table 1. Combining literature-derived case burdens with population-derived estimates of control
burdens led to unstable estimates of model parameters, due to the small numbers of
variants. To avoid widely fluctuating priors (with very low confidence), we used an
average estimated benign rare variant burden of 1% for these genes, yielding conservative
priors.

Predictors of pathogenicity

Having established prior odds for individual genes for each syndrome, we next consider
sources of evidence that can be used to obtain the posterior probability of pathogenicity
for an observed variant. First we describe the training data, comprising rare variants
of known pathogenic/benign status, which will be used to estimate model parameters.

Training data

We generated a compendium of previously reported sequence variants in our genes of
interest from HGMD professional version 2011.4, UniProt, dbSNP 135 and published case
series 5],12],25]. From these we selected a subset of variants that could be robustly categorised as
benign or pathogenic, assessed separately for each syndrome (LQTS, BrS and HCM). Variants
were defined as pathogenic if they were reported as causing disease with supporting
evidence of a functional effect in vitro and/or mechanistic information. Variants were defined as benign if identified in
prospectively ascertained cohorts of healthy individuals 25], or if present in any dbSNP population with a frequency 0.01. Variants not satisfying
either of these criteria, and hence assessed to have intermediate probability of pathogenicity,
were removed. In total, 320 variants were retained in the LQTS training set (164 pathogenic
and 156 benign), 73 variants in the BrS set (17 pathogenic and 56 benign), and 95
variants in HCM-related genes (67 pathogenic and 28 benign). These comprised 428 missense
variants, 41 radical and 19 inframe indels. Variants are shown in Tables S1,S2,S3
(Additional file 1).

For HCM genes, we also used a second, independently ascertained training set. Pathogenic
variants were taken from the training set previously applied to the PolyPhen-HCM classifier
12], supplemented with further pathogenic variants manually curated to the same robust
criteria, and additional benign variants identified in a well-phenotyped control cohort
6]. These sources yielded 18 pathogenic and 39 benign substitution variants.

Protein domains

The pathogenicity of a coding variant depends on its location within the protein.
Clustering of rare coding variants in hotspots has been described for a number of
proteins 26]. This may simply represent high local mutability and/or lack of constraint, if variants
cluster in both cases and controls. Alternatively, the region may be intolerant of
variation, and an increase in the proportion of variants that are pathogenic may be
informative for the interpretation of subsequent variants 5],16].

Therefore, we included domain-specific terms, with uninformative prior distributions,
in the linear predictors for inframe and missense variants. Each domain effect is
assumed to be normally distributed, with mean zero and variance an inverse-gamma random
variable. For each gene, protein domain annotations were obtained from Uniprot 27] using the ‘Regions’ fields of the Sequence Annotation section. We explored informative
prior distributions for domain terms; for example, estimates of the probability of
pathogenicity of variants found in distinct domains of KCNQ1, KCNH2 and SCN5A in 5] were considered. However, informative priors did not improve predictions, probably
because these correlated with domain effects estimated de novo from our LQTS training data, which included many of the variants used by Kapa et al. to derive their estimates. No domain term is included for the prediction of variants
in genes KCNJ2, CAV3, SCN4B, AKAP9, SNTA1, KCNJ5, CACNA2D1, CACNB2, GPD1L, SCN1B, MYBPC3 and TNNT2, because domains of these genes are not sufficiently represented in our LQTS, BrS
and HCM training data.

Allele frequencies in controls

Rare monogenic diseases cannot be attributed to variants that are common, though such
variants may contribute to the risk of common complex diseases, or act as modifiers
of rare disease. However, pathogenic variants may exist at a low frequency in populations
of ostensibly healthy controls, as these may be in a pre-clinical disease phase, or
phenotyped using methods with incomplete sensitivity. We explored the use of variant
frequency as a quantitative predictor, but found that the best performance was obtained
using a binary indicator of whether the variant has been recorded in a control database.
As control sets improve over time, this decision may need to be revisited.

Control data sets were taken from the 1000 Genomes project and the Exome Sequencing
Project 20],28]. The 1000 Genomes frequency was obtained from phase 1 of the project (2,184 haploid
exomes) and comprised data from 14 distinct populations including European (five),
African (three), East Asian (three) and Latin American (three) populations. The Exome
Sequencing Project frequency was derived from approximately 13,000 haploid exomes
from European-Americans and African-Americans in a 2:1 ratio. Published variants from
a further control cohort of seemingly healthy controls and volunteers 5] comprising 1,300 individuals (47% Caucasian, 26% African American, 11% Hispanic,
10% Asian and 6% unknown/other), were used for the three major LQTS genes (KCNQ1, KCNH2 and SCN5A). For our LQTS training data, 40% of the benign variants had a non-zero frequency
in the controls, in contrast to 2% of pathogenic variants (typically only one occurrence).

Conservation

Where a protein residue is conserved across a wide range of species, this is widely
interpreted as evidence that the residue is functionally important and hence intolerant
of variation. We used protein sequence alignments of Ensembl-defined one-to-one orthologues
for up to 70 species, and for each variant we recorded the species in which the corresponding
residue was conserved. Due to the variable quality in genome sequencing and the automated
nature of the orthologue alignments, a residue was deemed as conserved if it aligned
with the same amino acid and non-conserved if aligned with a different amino acid;
for residues aligning with gaps in the alignment or X (indicating poor sequence quality),
the data for that species were disregarded. Residues were initially classified into
five non-overlapping categories indicating whether they were conserved in all 70 species,
all vertebrates, (eutherian) mammals, primates or were not conserved. During model
fitting, this was simplified to three categories of conservation, encoded as two binary
indicators (conserved in primates and conserved in all species) with not conserved
as the reference state.

Non-synonymous single nucleotide polymorphism algorithms

Several algorithms aim to predict the pathogenicity of missense substitutions, largely
based on physicochemical properties of the reference and alternate amino acids and
their sequence contexts. Three of these have been incorporated as predictors in our
model.

The Grantham score 29], first described in 1974, is a measure of the physicochemical similarity of pairs
of amino acids. It is not dependent on sequence context. Polymorphism phenotyping
v2 (PolyPhen) aims to predict the impact of single amino acid substitutions using
sequence, phylogenetic and structural information 30]. It is available in two flavours, trained on different data sets. HumDiv aims to
identify alleles that may alter protein function, while HumVar aims to discriminate
variants with large effects (sufficient to cause Mendelian disease) from alleles that
are at most only mildly deleterious. Our model uses the HumVar classifier. The sorting
intolerant from tolerant (SIFT) algorithm also predicts whether an amino acid substitution
affects protein function, based on the degree of conservation of amino acid residues
in sequence alignments derived from closely related sequences 31]. Importantly, SIFT does not predict whether an alteration in protein function will
be sufficient to cause disease.

The distributions of Grantham, PolyPhen and SIFT scores obtained for pathogenic and
benign variants in the LQTS training set are shown in Figure 1. The PolyPhen and SIFT scores both lie between 0 and 1, but with opposite polarities
(higher PolyPhen scores and lower SIFT scores both indicate greater risk of pathogenicity).
Grantham scores range up to 205 in the training set, with higher values indicating
greater risk of pathogenicity. To facilitate interpretation of model coefficients
in our model, Grantham scores were rescaled to lie between 0 and 1 by dividing by
205, and SIFT was replaced by 1?SIFT.

Figure 1. Comparison of the distributions of predictor variables for benign and pathogenic variants. Histograms depict the numbers of variants that are pathogenic (magenta bars) and
benign (black bars) with predictor values within a range as indicated on the x-axes for the four predictor variables. Conservation categories are defined in the
Methods section.

Constructing the Bayesian prediction model

Our Bayesian logistic regression model is represented diagrammatically in Figure 2, and in BUGS format in Additional file 2. For each variant, the logistic transform of the probability for it to be pathogenic
is calculated as a sum over contributions from the predictor variables described above.
The contributions from quantitative predictors are either linear or quadratic.

Figure 2. Graphical representation of the three prediction models for a single syndrome. The logistic regression models are represented as the three rectangles on the right,
for a radical variant (top), an inframe indel (middle) and a missense substitution
(bottom). Ellipses describe model predictors. Each model is additive on a logistic
scale. Multiple arrows emerging from an ellipse indicate that the parameter is shared
across the models indicated by the destinations of the arrows. This diagram represents
the model for one syndrome.

Our model proposes interrelated linear predictors for each of the three classes of
variant, chosen because of background information indicating different probabilities
of pathogenicity associated with each class. While the majority of cases of inherited
cardiac conditions are caused by missense substitutions, there is also an appreciable
burden of rare missense variants in healthy controls 5],25]. By contrast, radical variants are almost absent from the general population, so
that for many inherited cardiac conditions such a variant in a patient is conventionally
considered pathogenic 5]. Inframe indels have previously been classified as radical, because they are also
rare in healthy controls 5], but we have classified these separately because of the evidence for different properties
of this class 25]. The likely effects of different variant classes also depends on the molecular mechanism
of disease. A missense variant may lead to either loss or gain of function, whereas
radical variants usually produce loss of function. The probability of pathogenicity
of radical variants should depend on whether disease is mediated by loss or gain of
function of the gene of interest. Non-coding variants and near splice site substitutions
that do not disrupt the consensus donor/acceptor sites are not considered in the model.

The model for radical variants is the simplest, including only two predictors (variant
frequency indicator and gene effect). The intercept represents the average pathogenicity
of radical variants (Figure 2) and it, like other intercept parameters introduced below, is assigned a N(0,10) prior distribution. Domain terms are not included because the fraction of radical
variants that are pathogenic is similar across domains 5], and loss of function is expected to be largely independent of the precise location
of the truncating variant. Conversely, gene effects are important for the pathogenicity
of radical variants due to the differences between genes in the effects of gain-of-function
and loss-of-function variants, among other factors.

The model for inframe indels is similar to that for radical variants, but also includes
domain-specific terms. Domain terms have normal prior distribution centred at 0 (its
variance is estimated from the data and has an inverse-gamma prior distribution).
We use the same gene effects as for missense variants. For interpretability of the
parameters, we also include the intercept that measures average pathogenicity over
all non-radical variants, and we add to this the inframe effect term, which measures
any excess pathogenic risk of inframe variants over missense substitutions.

The richest model is for the missense substitutions, which includes all the predictors
mentioned above plus Grantham, SIFT, PolyPhen and the two conservation indicators,
for primates and all species. For the three quantitative predictor variables, we considered
various models in a series of simulation experiments and found the best performance
with SIFT modelled as a linear covariate, but Grantham and PolyPhen scores both included
as quadratic functions. The linear part of each model is assigned a N(0,10) prior distribution.

Our model uses several hyperparameters. One hyperparameter determines the standard
deviation of a normal prior. Two hyperparameters specify a gamma distribution that
models domain effects. We have tried experimentally varying the values of these parameters.
When we replaced standard deviations of normal distributions by larger values, for
example 100, we found that the fitted effect sizes of predictors became large, leading
to overfitting. In contrast, smaller standard deviations shrank the effect sizes too
much, thus preventing confident predictions for many variants. The parameters of the
gamma distribution had a less noticeable effect on the results; in the end we chose
parameters that are consistent with our expectations of the magnitude of domain effects
and that prohibit overfitting, which is a potential concern with nearly 60 domains
of LQTS genes.

Model fitting and assessment

All model parameters are fitted simultaneously over the three variant classes to allow
simultaneous estimation from the training data set of the shared parameters indicated
by multiple arrows in Figure 2. Estimation for LQTS and BrS models was also done jointly for combined estimation
of the shared parameters (variance of domain effects, scale of gene priors, intercepts
for all variant classes and parameters modelling the effect of Grantham, SIFT, PolyPhen
and conservation). The HCM model was fitted separately and also in combination with
the LQTS model (the shared parameters were a subset of those shared for LQTS and BrS:
variance of domain effects and parameters modelling the effect of Grantham, SIFT,
PolyPhen and conservation).

The fitting was done by a Markov chain Monte Carlo (MCMC) procedure within the JAGS
software 32]. Ten chains, each with 40,000 iterations, were simulated to estimate the effect sizes.
The first 20% of outputs were discarded as burn-in. The convergence of simulations
was assessed by a potential scale reduction estimator 33] applied to each parameter separately:

where

while B is a variance in sampled parameter values between ten chains, W is a variance in sampled parameter values within a single chain and n=32, 000 is the number of non-discarded simulations. We required R1.1 to terminate, which was always satisfied after 40,000 iterations.

The post burn-in simulations were used to report parameter medians, which were subsequently
used as fitted values in a prediction model for novel variants. Posterior probability
distributions for the model parameters can be estimated from the MCMC output, which
is useful for model diagnosis. Goodness of fit for all models was assessed via cross-validation.
We randomly split the data set 100 times into training and test sets such that each
test set had 1/10 of pathogenic and 1/10 of benign variants. We fitted the model (Figure
2) to each training set and then calculated the pathogenic risk for the test variants.

Although the model outputs a posterior probability of pathogenicity that is continuous
and innately interpretable, we also assess model performance by applying a threshold
to probabilities of pathogenicity and calculating standard metrics including sensitivity,
specificity and positive predictive value (PPV). To ensure that the PPV appropriately
reflects the prevalence of variants in the intended test population (rather than the
proportion of benign and pathogenic variants in the cross-validation test set), we
calculate PPV using the equation:

(5)

where the prior odds are derived for each syndrome as described above and summarised
in Table 1. Recall that the true positive rate is synonymous with sensitivity, and:

A model description in BUGS format, a table of fitted model coefficients, and scripts
necessary to reproduce these analyses are provided in Additional files 2, 3, 4.