CoagVDb: a comprehensive database for coagulation factors and their associated SAPs


Search for gene symbol

The CoagVDb provides access to genomic information related to coagulation factors.
Information consists of HGNC gene name, gene symbol, gene ID, organism source, taxonomy
identification, chromosome number, chromosome location, chromosome sequence (NC),
and NCBI map viewer. Gene related information present in other databases such as OMIM,
Ensembl, and UniProtKB were also interlinked. This allows the user to search for Ensembl
gene ID, HUGO gene name or HGNC gene ID, Entrez Gene ID, and OMIM gene ID. The gene
information such as the DNA sequence and chromosomal location was obtained from NCBI.
Epigenetic information such as histone modifications, chemical changes in DNA, chromatin
accessibility, gene expression and small RNA expression is available in the epigenomics
section of the NCBI database. Further, to enrich the available information on the
gene sequence, the CoagVDb provides integration with the NCBI MapViewer in the form
of hyperlinks. Moreover, the database forms a simple network between various other
databases and information sources through conveniently openable links in a new tab.
Literature information on the gene coding the factors, properties of the proteins
involved in the mechanism, and the consequences of variations that result in disease
phenotypes were obtained from published articles available in PubMed, OMIM and UniProtKB
and referred with PubMed ID. Some of the information integrated with the other database
includes HGNC, Entrez, Ensembl, UCSC, OMIM, UniProtKB, UniGene, RefSeq, and KEGG.

Search by variant name

To display the records of 3187 SAPs, we designed user-friendly web interface in CoagVDb.
To increase the accuracy of SAPs annotation, we initially collected the SAPs related
information from dbSNP and overlapped with the variant information from UniProt. We
listed the SAPs with reference links to rsIDs/variants, amino acid position, allele
change, contig position, protein ID (FASTA sequence identification number (NP) and
UniProt sequence number). The amino acid change in each variant is represented by
wild and new residue after mutation (Single letter amino acid code) e.g. WT+POS+NEW.

Protein information

Information related to the protein can be accessed in two ways; sequence and structure
information.

Sequence information

This section of the database entry contains information about the protein sequence
that integrates sequence analysis information from various computational methods.
The sequence information includes sequence length, amino acid composition, solvent
accessibility, secondary structural elements, ordered/disordered regions, cysteine
residue location, disulfide bond formation and conservation score provided in the
form of a table.

Amino acid composition

Amino acid sequence composition analysis can provide the most direct information about
the functional mutation sites of the protein. Recent studies have explored the occurrence
of various amino acids along with their biophysical characteristic in the native and
mutant state of numerous proteins 23]–25]. We calculated the composition of each amino acid in corresponding coagulation factor
protein sequence by Statistical Analysis of Protein Sequences 26]. For this analysis, we submitted the individual protein FASTA sequence as an input
file.

Secondary structure analysis and solvent accessibility

We analyzed the occurrence, location and distribution of secondary structural elements,
?-helices, ?-strands, turns, and bends. Amino acids distributions among these elements
were considered to be the essential structural components of protein scaffolds. Secondary
structure and solvent accessible area of each amino acid in the protein sequence was
calculated using NetSurfP ver. 1.1 [27]. The secondary structure elements were represented as H: Alpha-helix; G: 3-10-Helix;
I: Pi-helix (extremely rare); E: Extended strand; B: Beta-bridge; T: turn; S: Bend;
and C: The Rest. Solvent accessible area of each amino acid is classified as buried
and exposed and represented in red and black color, respectively.

Disordered residues

The disordered region in a protein sequence is characterized by the presence of enriched
polar and charged amino acids with low percentage of hydrophobic amino acids 28]. DISpro 29] was utilized to predict the probability of each amino acid residue to be ordered
or disordered. The residues were designated as O-Ordered; D-Disordered in the output
file.

Cysteine residues and disulfide bonds

Studies have highlighted the importance of Cys residues and disulfide bonds in protein
folding 30]. Amino acid residue change to (or) from Cys is most likely to destabilize a protein
structure. Taking into consideration, we extended our analysis of sequence information
by the application of DIpro 31] to predict disulfide bonds and estimate the number of disulfide bonds in a given
protein sequence.

Sequence conservation

Disease-causing SAPs often reside in highly conserved positions. Assessment of non-neutral
SAPs is primarily based on phylogenetic information (i.e. correlation with residue
conservation) extended to an individual scale with structural approaches. A multiple
sequence alignment of the homologous sequence reveals the position at which amino
acids are conserved throughout evolutionary time. These positions can be critical
for protein function 32]. Initially, we performed multiple sequence alignments (MSA) using multiple sequence
comparison by log-expectation (MUSCLE), a web-based tool to align multiple sequences
from several vertebrate species including humans 33]. We searched the protein sequence of coagulation factors against a sequence database
to find sequences of homologous proteins. The importance of a residue for maintaining
the structure and function of a protein can usually be inferred based on the conservation
pattern. ConSurf 34] quantifies the degree of conservation at each aligned position to represent localized
evolution. This server provides the evolutionary conservation profiles of protein
or nucleic acid sequence or structure by first identifying the conserved positions
using MSA and then calculates the evolutionary conservation rate using an empirical
Bayesian inference.

Structure information

Data on the available three-dimensional (3D) structure coordinates of coagulation
factor proteins were listed in this database section. Experimentally determined structures
either by X-ray or NMR were obtained from the protein data bank (PDB) 35]. In addition, we incorporated 3D structure resolution, chain type, and amino acid
residue position information.

Prediction tools

We predicted the functional effect of each SAP as pathogenic/deleterious or neutral/tolerated
by using computational prediction methods such as SIFT 15], PolyPhen 2 16], I-Mutant 3 17], fathmm 18], Align GVGD 19], PhD-SNP 20], SNPsGO 21] and SNAP 22]. The methods mentioned above utilize different input features in making their predictions,
but the ultimate goal is to discriminate deleterious or functional SAPs from neutral
ones. We submitted either gene identification (GI) number or FASTA sequence or Swiss-Prot
protein code, substitution position (sequence residue number) and native or wild type
residue (Single letter amino acid code) and new residue after mutation as mutant (single
letter amino acid code) e.g., WT+POS+NEW as input. Integrating the prediction scores
of sequence (SIFT, PhD-SNP, Align GVGD, and fathmm) and combination of sequence and
or structure based (PolyPhen-2, SNAP, SNPsGO and I-Mutant 3) computational methods
may provide wider coverage and more accurate predictions in the study of SAPs. Above
utilized methods derive their information from the multiple sequence alignment of
the homologous sequences to give more information about the extent of conservation
based on the input generated internally (SIFT SNPsGO) or submitted by the user
(PolyPhen-2 Align GVGD). Detailed information regarding the prediction scores of
the above eight computational methods is described in Additional file 1: Table S2. We have introduced a ranking scheme to prioritize the variants based on
the prediction score designated as ‘deleterious’ obtained from the above eight computational
methods. Variants/rsIDs showing all 8 tool prediction score as deleterious will be
ranked as ‘1’, variants showing 6–7 tools prediction score as deleterious will be
ranked as ‘2’, variants showing 4–5 tools prediction score as deleterious will be
ranked as ‘3’, variants showing 2–3 tools prediction score as deleterious will be
ranked as ‘4’, variants showing 0–1 tool prediction score as deleterious will be ranked
as ‘5’ respectively (Figure 1).

Figure 1. Coagulation variation database (CoagVDb) construction.

Web interface

The freely accessible CoagVDb allow users to perform ‘quick search’ using keyword
gene symbol, variant/rsIDs in the left navigation bar. We have listed out the additional
information regarding the genes, disease, diagnosis, FASTA sequence, download, and
site map of 29 coagulation proteins. Gene button module allows the user to provide
a direct link to 29 gene information. Disease button allows user to access the useful
information such as disease name, inheritance pattern, OMIM ID, disease classification
(primary and secondary hemostasis) and occurrence (most common, less frequent and
extremely rare). Diagnosis button provides information about the preliminary screening
protocol, laboratory evaluation of coagulation disorders of common and multiple pathways
and diagnosis of coagulation disorders using preliminary screening tests. Download
button allow the user to download the gene, protein, variant, tool prediction related
information of 29 coagulation protein in.xls format. Lastly site map was created to
provide an overview of the database which allow user to access and navigate in a friendly
manner. In the main web interface section ‘home’, we have provided the diagrammatic
representation of coagulation cascade pathway which illustrates the involvement of
various factors in intrinsic, extrinsic and common pathways. The active form of coagulation
factors is represented in the grey color oval button with a hyperlink, whereas orange,
violet, green color oval button represents the inactive form of factors in intrinsic,
extrinsic and common pathways. The main taskbar “search” button allows the user to
pick up gene, variant, sequence, structure, and prediction tools information. Clicking
on the information tab enables the user to display the detailed information page of
the corresponding entries with a hyperlink in the browser. In order to make an easy
way to jump between sections, we have provided the all the information regarding a
gene i.e. variant, sequence, structure, and prediction tools information on the same
page. In addition, we have provided ‘Resource’ module which allows user to cross-link
to other databases (NCBI, OMIM, HGMD, etc.), computational prediction methods (SIFT,
PolyPhen2, PHD-SNP etc.) and related links (related coagulation factor databases (ClotBase
11], Factor IX Mutation Database 36] etc.) with their corresponding hyperlinks. Lastly, help button provides guidance
to user how to access different search fields using PLAU as an example.

Resources

In the navigation bar resource section, we have listed database, tool information,
and their related links. Database tab lists out the biological database source which
is available online along with their hyperlinks. Tool information tab provides the
detailed information about the computational methods which are employed to classify
SAPs. Lastly, related links tab contains the information about the existing databases
related to coagulation factors.

Comparison to existing databases

To the best of our knowledge related to coagulation factors, only a few databases
are available online during the construction of this database. Most of the existing
online available databases ClotBase 11], Factor VIII variation database 12], factor IX variation database 13], VWFDb 14], and FXI Deficiency Mutation Database 37] are centered towards individual coagulation factors. In comparison, ClotBase 11] offers compiled data on the blood coagulation proteins. Information regarding the
change in amino acid sequence, evolutionary conserved regions, mutations, and other
curated data has been made available to this database. Von Willebrand Factor database
or VWFdb is an online database that centers on von Willebrand disease. This database
primarily contains sequence variants data and provides additional resources to understand
the disease association. The Haemophilia A Mutation, search, test, and resource site
(HAMSTeRS) was initiated in 1996. This contains information on factor VIII of blood
coagulation and extensive data on the point mutations, insertions, and deletions.
Data obtained from computational analysis of the mutations and structural studies
have also been included. Now this has been shifted to UCL F8 DB (HADB/HAMSTeRS). CoagMDB
is a database that carries information on five serine protease factors of the blood
coagulation pathway. This interactive database incorporates all five factors factor
II, factor VII, factor IX, factor X and protein C and their corresponding mutational
information. The mutations were correlated with experimentally quantifiable phenotypes
with the help of data available on consensus domain structures. The FXI Deficiency
Mutation Database was created to concentrate the information available regarding the
mutations in the gene sequence of factor XI.

In comparison to the databases mentioned above, the information available in CoagVDb
spans out in following ways: first, a simplified platform for viewing all coagulation
factors along with gene/protein and rsIDs/variant information. It links HGNC, Entrez,
Ensembl, UCSC, OMIM, UniProtKB, UniGene, RefSeq, and KEGG. Secondly, we have provided
sequence information (amino acid sequence length, composition, solvent accessibility,
secondary structural elements, ordered/disordered regions, cysteine residue location,
disulphide bond formation and conservation) along with available 3D structure information
(X-ray or NMR). This feature will allow users to access the physicochemical characteristic
of each native and mutant amino acid. Third, we have included pathogenicity prediction
scores for each SAP using various sequence and structure based prediction methods
will allow the user to discriminate deleterious SAPs from neutral ones from a pool.
Lastly, we have applied a ranking scheme to prioritize the functional SAPs based on
deleterious scores obtained from the computational prediction methods. This added
advantage over the existing databases efficiently helps to identify and classify the
SAPs that alter the function of coagulation factor proteins. Moreover, rich content
made available in CoagVDb is easy to use and interpret by any end user.