PCAN: phenotype consensus analysis to support disease-gene association

Prior knowledge resources (step 1 and 2)

Two resources were used to enable genes to be linked to phenotypic abnormalities based on the clinical symptomatology of the genetic disorders each gene is known to cause (Fig. 2).

https://static-content.springer.com/image/art%3A10.1186%2Fs12859-016-1401-2/MediaObjects/12859_2016_1401_Fig2_HTML.gif
Fig. 2

PCAN prior knowledge resources. Public resources used to link genes to phenotypic abnormalities based on the genetic diseases each gene causes. The HPO phenotype annotation resource (build #1039) was used to link HP terms to OMIM disorders and ClinVar (version of May 2015) was used to retrieve genes that cause OMIM disorders. Total counts of each distinct entity type in the resultant gene-trait resource are provided

The Human Phenotype Ontology (HPO) [9] (build #1529) is used to formally describe phenotypes, as sets of human phenotype (HP) terms, to enable their intercomparison. We only consider HP terms descended from the “Phenotypic abnormality” (HP:0000118) branch of the HPO. The phenotype annotation resource (build #1039) provided by the HPO was used to list HP terms assigned to each OMIM disorder.

The ClinVar database [18] (version of May 2015) was used to identify genes (using Entrez Gene [19] identifiers) causally linked to Mendelian diseases. Here we focused on diseases reported within OMIM and linked variants with a pathogenic clinical status and one of the following origins: germline, de novo, inherited, maternal, paternal, biparental or uniparental. In summary 3,181 human genes were associated to 3,656 diseases to which at least one HP term descendant of “Phenotypic abnormality” is related (4,549 associations in total).

Pathway and biological network resources were used to identify mechanistically related genes in order to assess phenotype consensus. Two pathway resources were used to identify genes that encode proteins, which function in common signaling pathways: Reactome and Thomson-Reuters’ Metabase. Reactome [17] is a free, open-source, curated and peer reviewed pathway database. Here we used version v52 to associate 7,580 human genes to 1,345 individual pathways. Metabase (http://thomsonreuters.com/en/products-services/pharma-life-sciences/pharmaceutical-research/metabase.html) is a comprehensive manually curated database of mammalian biology and medicinal chemistry data. Here we used version 6.20.66604, which includes 6,978 human genes within 1,465 pathways.

To identify gene neighbors, two biological network databases were used: the STRING database and once again Metabase. STRING [16] is a database of known and computationally predicted protein interactions. Interactions include both direct (physical) and indirect (functional) associations. We focused on the 1,249,080 direct interactions involving 17,114 human genes, within STRING version 10. STRING also provides a measure of confidence for each interaction as a score ranging from 0 to 1,000. In the following analyses we consider either the whole STRING network or only a high quality (HQ) subnetwork involving interactions with a score [20] greater than or equal to 0.5 (507,298 interactions between 13,712 genes). Additionally, 862,660 interactions, involving 23,136 genes, were extracted from Metabase. Among these interactions, 238,171 (involving 17,265 genes) are assigned a high trust and form the Metabase high quality (HQ) subnetwork.