Identification of genomic features in the classification of loss- and gain-of-function mutation

Tagged mutations from the literature

We obtained 14,259 gene-sentence relationships for GoF and 29,586 relationships for
LoF. From these relationships, genes which did not have sentences were removed. We
obtained 2,142 sentences for GoF and 4,600 sentences for LoF as a result. Next, tmVar
15] found 474 mutations for GoF and 816 mutations for LoF. Consequently, we obtained
474 mutations from 2,142 sentences for GoF and 816 mutations from 4,600 sentences
for GoF.

Overlapping genes

First, from the literature, we obtained mutation information for 816 LoF and 474 GoF
mutations. Next, during the data-preprocessing step, mutations whose reference alleles
were not matched with a reference genome or mutations published before 2009 were removed.
As a result, there remained 258 LoF mutations and 129 GoF mutations. We extracted
gene names from the 258 LoF mutations and 129 GoF mutations. Finally, 109 LoF genes
and 59 GoF genes were selected for further analysis. Figure 2 demonstrates that there were 15 common genes. However, since the gene names are distributed
broadly, there is no pattern that can be used to classify LoF and GoF mutations.

Figure 2. The number of genes in LoF and GoF mutations. The number of gene lists for LoF and GoF mutations

Subcellular location

The subcellular location information of the proteins was collected from the UniProt
database 22]. We then analyzed the enriched subcellular locations of the LoF and GoF mutant genes
using a hypergeometric test against the information of the total of 22119 subcellular
locations of human proteins. Figure 3A shows the calculated distributions of the subcellular locations of the LoF and GoF
genes, including in each case the nucleus, cell membrane, cytoplasm, membrane, and
secreted. The subcellular location which contains the highest number of LoF mutated
genes is the nucleus (19.61%), while in GoF it is the cell membrane (23.44%). When
we implemented the hypergeometric test to compare the distributions between the LoF
and the human results, and between the GoF and the human results, we found that the
distributions of the subcellular locations of LoF mutations and the background subcellular
locations were significantly different in the cell membrane and cytoplasm (P-value
= 0.0159, 0.0180). In addition, the distributions of the subcellular locations of
the GoF mutations and the background subcellular locations were significantly different
in the nucleus, cell membrane, and membrane (P-value = 0.0356, 0.0001, 0.0254, respectively).

Figure 3. Subcellular location distribution mutation subtype distribution. A. Distribution of the subcellular locations of LoF and GoF, and human mutations.
The Y axis shows the percentage of each subcellular location in each class. B. Mutation
compositions in the LoF and GoF, and human classes. The Y axis shows the percentage
of each mutation type in each class.

Mutation subtypes

Next, we extracted mutation subtypes from the LoF and GoF mutations and compared their
distributions. In this work, we used six types of mutations: missense, nonsense, deletion,
indel, duplication and frame shift. Figure 3B shows the distribution of the mutation subtypes of LoF and GoF. MacArthur et al. studied LoF variants, but did not focus on the missense mutations 4]. However, our study shows that the most frequently found type of mutation is the
missense mutation in both cases for LoF and GoF mutations. This ratio indicates that
missense mutations are also an important proportion of the mutations which affect
protein functions. The second most frequently found mutation is the nonsense mutation
in LoF; for GoF, it was the deletion mutation. These results indicate that nonsense
mutations usually lead to a protein which causes a loss of function and not a gain
of a new function.

Reference and substituted allele ratio

We extracted the nucleotide reference alleles and substituted alleles and classified
allele pairs into two groups based on their nitrogenous bases. If a reference allele
and a substituted allele were of the same nitrogenous base (purine and purine, pyrimidine
and pyrimidine), the mutation was classified as a transition (Ti). Otherwise, if the
two nitrogenous bases were different, it was classified as a transversion (Tv). We
then analyzed the differences in the proportions of the each allele pairs between
the LoF and GoF mutations. Figure 4 shows the ratio of reference and the substituted allele pairs, and Table 2 describes P-values pertaining to the result of comparing the LoF and GoF mutations
using the propositional binomial. There were no significant differences found in the
TiTv ratio between LoF and GoF, but for the LoF case, the transition (Ti) percentage
is higher than the transversion (Tv) percentage. In addition, the AG, CT, AT, and
GT results show significant differences (P-values: 0.0347, 0.0376, 0.0426, and 0.0399,
respectively).

Figure 4. Allele. A. Reference to substituted allele rates and transition and transversion (Ti, Tv)
rates in LoF. B. Reference to substituted allele rates and transition and transversion
(Ti, Tv) rates in GoF.

Table 2. Allele P-value.

Protein domain

We collected the protein domain function information of each mutation. Figure 5 shows the distribution of the protein domain functions of the LoF and GoF mutations.
The protein domain functions are distributed broadly, but some of them show several
exclusive cases between the two classes. This result indicates that LoF and GoF mutations
tend to affect different protein functions. Next, we analyzed the protein domain functions
of the LoF and GoF mutations using a hypergeometric test against 55931 human protein
domain functions.

Figure 5. Protein Domain. LoF and GoF rates in protein domains

Mutation impact

The FIS method was used to estimate the significance of the missense mutation effects
on the protein functions, classifying mutations into four grades based on the estimated
scores: high, medium, low, and neutral 13]. Mutations classified as a higher grade mutation have a more of an effect on protein
functions than those classified as a lower grade mutation. Figure 6 shows the distributions of the functional impact grades of the LoF and GoF mutations.
The percentages differ in high-impact mutations as compared to low-impact mutations.
The LoF results shows a higher percentage of high-impact mutations than the GoF results
(LoF: 24.49%, GoF: 14.81%) as well as a lower percentage in low-impact mutations (LoF:
20.41%, GoF: 29.63%). This result indicates that LoF mutations affect the protein
function more than GoF mutations.

Figure 6. Mutation Impact. Distribution of the functional impact of missense mutations. The functional impact
is the estimation how much the mutation affects the protein function. The Y axis shows
the percentage of each functional impact in each class.

Classification of LoF versus GoF with selected features

To confirm whether or not the properties can be used as criteria for distinguishing
LoF and GoF mutations, we implemented a classification technique using the support
vector machine, random forest, and linear logistic regression methods with 50 data
sets which contain equal numbers of LoF and GoF mutations to avoid bias. As a result
of five-fold cross validation repeated 100 times, we obtained 25,000 results for each
classifier and calculated the averages of the total accuracy, true positive rates
(rates of LoF correctly classified as LoF), and true negative rates (rates of GoF
correctly classified as GoF). Figure 7A shows the accuracy rates, the true positive rates (sensitivity), and the true negative
rates (specificity); while Figure 7B shows the AUC rates. The average percent correctly classified was 71.28% for the
support vector machine method, 72.23% for the random forest method, and 70.19% for
the linear logistic regression method, while the AUC values for each classification
were 0.7128, 0.7880, and 0.7646. From these results, we can confirm the discriminative
power of the six features.

Figure 7. Classification Result. A. Accuracy, sensitivity, and specificity rates of the three classifiers used here:
linear logistic regression, random forest (RF), and support vector machine (SVM).
B. AUC values of the three classifiers.