Context-based resolution of semantic conflicts in biological pathways

Constructing an integrated biological pathway dataset

We first constructed a large-scale biological pathway set to interrogate the existence
of conflicting information in our current knowledge. The pathway set consists of two
major information resources for publicly reported biological pathways: a public canonical
database and biomedical literatures.

Information extraction from public pathway database

We used the KEGG pathway database 5] as the test base of canonical pathways. Although the KEGG does not represent the
entire pathway database, it provides highly reliable information through 281 manually
confirmed human pathways that contain directional relationships between biomedical
entities. Also, many associated tools and libraries enable a robust information extraction.
At first, we extracted relation information from four canonical curated pathway databases
including KEGG, PID, SMPDB, and ReconX. Among various types of conflict, we chose
the conflict where contradictory information (i.e. â€˜A increases Bâ€™ and â€˜A decreases
Bâ€™) exist in a same interaction. In case of â€˜reactâ€™ or â€˜translocateâ€™ relation, itâ€™s
not straight forward to choose which relations are contradictory with â€˜reactâ€™ or â€˜translocateâ€™.
Most information in the databases except KEGG contains â€˜reactâ€™ or â€˜translocateâ€™ information.
In case of protein-protein interactions, the information is not used because they
have no direction. We, therefore, only used the relation information in KEGG. Nonetheless,
other pathway databases (i.e. BioCarta) can be added for additional analysis.

We downloaded KEGG Markup Language (KGML) files of 281 human pathways from KEGG pathway
database using KEGGgraph package in R 6]. KGML files contain a set of records each of which consists of an entry field and
a relation between two entities. We could further confirm the relational information
using a KEGG ID (i.e. has:1432) in the entry field. Each relation has two entities,
entity1 and entity2 where entry1 acts on entry2. Relations are annotated with interaction
types (i.e. gene regulation, protein-protein interaction, and protein-chemical interaction)
and subtypes (i.e. activation, inhibition, phosphorylation, dephosphorylation, glycosylation,
ubiquitination, methylation, indirect effect, missing interaction, expression, repression,
binding/association, dissociation, and reaction).

Based on the KGML file structure, we extracted 41,207 type/subtype-annotated relations
among 5,457 biomedical entities.

Information extraction from biomedical literatures

Relations among biological entities were extracted from 13,214,710 PubMed abstracts.
To accomplish this, we first tagged entities from the literatures using a set of existing
Named Entity Recognition (NER) tools; different NER tools were used to tag different
entity types.

We defined eight biomedical entity types including â€˜geneâ€™, â€˜proteinâ€™, â€˜cell typeâ€™,
â€˜diseaseâ€™, â€˜organâ€™, â€˜metaboliteâ€™, â€˜drugâ€™ and â€˜biological processâ€™. Based on the reported
performance of NER tools using a manually curated corpus 7,8], we selected one best tool for each entity type. By this criterion, four NER tools
were selected; Genia 9] for tagging â€˜geneâ€™, â€˜proteinâ€™ and â€˜cell typeâ€™, BANNER 10] for â€˜diseaseâ€™, LingPipe 11] for â€˜organâ€™ and NERSuite 12] for â€˜metaboliteâ€™, â€˜drugâ€™ and â€˜biological processâ€™. According to the papers explaining
the methods (see Table 1), we chose tools to extract each type of biological entities. The performances of
GENIA was better than BANNER. Dataset which is used were, however, different. So we
could not compare each other. These are the reason why we chose GENIA. The running
time of GENIA was shorter than that of BANNER. And the coverage of GENIA was wider
than that of BANNER, for example GENIA can find cell type in a literature.

Table 1. Performance of each tool.

We considered that a sentence in a literature contains a relational information particularly
when the relational word (i.e. increase, activate or repress) is located between two
biomedical entities (see Table 2). For example, â€˜activateâ€™ is located between â€˜CYP1Aâ€™ and â€˜benzo[a]pyreneâ€™ in following
sentence; â€œIn contrast, CYP1A isoforms can also activate some compounds, such as benzo[a]pyrene to their carcinogenic metabolite, and the induction of these isoforms increases the
risk of carcinogenicity.â€.

Table 2. Pairs of entities in meta-relations

In this manner, totally 2,575,846 relations among 53,799 entities have been extracted
from literature.

Mapping relations to meta-relations

To investigate the consistency and inconsistency among interactions, we mapped the
collected relationships between entities in the integrated pathways to two meta-relations,
â€˜increaseâ€™ and â€˜decreaseâ€™. As the word â€˜meta-relationâ€™ stands for, â€˜increaseâ€™ and
â€˜decreaseâ€™ are not confined to their literal meanings, but include all the possible
relations in which one entity affects the other in a positive or negative way. For
instance, â€˜increaseâ€™ may represent (but not be limited to) 1) increase of molecular
quantity (i.e. activation of mRNA transcription), 2) activation of gene/protein functions
(i.e. signal transduction by phosphorylation) and 3) induction of phenotypes (i.e.
triggering of apoptosis or causing a disease). The â€˜decreaseâ€™ meta-relation can be
defined in a similar way.

We observed totally 14 relational terms from the KEGG database containing [â€˜activationâ€™,
â€˜inhibitionâ€™, â€˜phosphorylationâ€™, â€˜dephosphorylationâ€™, â€˜glycosylationâ€™, â€˜ubiquitinationâ€™,
â€˜methylationâ€™, â€˜indirect effectâ€™, â€˜missing interactionâ€™, â€˜expressionâ€™, â€˜repressionâ€™,
â€˜binding/associationâ€™, â€˜dissociationâ€™, and â€˜reactionâ€™]. Among the 14, five relations
including â€˜activationâ€™, â€˜phosphorylationâ€™, â€˜glycosylationâ€™ â€˜methylationâ€™ and â€˜expressionâ€™
have been mapped to the â€˜increaseâ€™ meta-relation. On the other hands, three relations
including â€˜inhibitionâ€™, â€˜dephosphorylationâ€™ and â€˜repressionâ€™ have been mapped to the
â€˜decreaseâ€™ meta-relation. The remaining six terms could not be mapped due to their
ambiguity in functional directionality.

In the biomedical literature pathways, we found 64 distinct relations (see Table 3). We mapped 36 relations to â€˜increaseâ€™ and 28 relations to â€˜decreaseâ€™.

Table 3. Mapping table

We considered the form of sentence to annotate the extracted meta-relations to their
corresponding entities. When a sentence was written in an active form (i.e. â€œA increases
Bâ€), we regarded the former entity as the acting one and the latter as its target.
On the contrary, we switched the roles of two entities when a sentence was written
in a passive form (i.e. â€œA is increased by Bâ€). We used a pattern relation extraction
to derive the sentence form.

Extracting context information

Definition of context

To identify relations that are valid under only a specific condition, we defined and
considered four types of context: organ (OG), cell- type (CT), disease (DS) and drug
(DR).

Organ and cell-type context provides the location where the corresponding relation
arises. For instance, a statement â€˜A increases B in liver (OG)â€™ constraints the relationâ€™s
effect range to liver. Cell-type is similarly defined. Disease and drug context provides
the clinical effect range of the corresponding relation with respect to phenotypic
and environmental conditions respectively. For instance, â€˜A decreases B in lung adenocarcinoma
(DS)â€™ or â€˜A increases B in (or given) Herceptinâ€™ can be interpreted in a similar way.

To build the ontology of four different contexts, we obtained 1,191 organ, 526 cell-type
and 4,620 disease terms from MeSH (Medical Subject Heading) database 13] (Table 4). For drug context, we used 6,825 terms from the DrugBank database 14]. Totally, 13,162 context terms were prepared for subsequent context information extraction.

Table 4. Statistics of four contexts and their resources

Extraction of context information

Based on the 13,162 context terms, we attempted to extract the context information
associated to each relation in the constructed integrated biological pathway set.
Given that the canonical pathway database does not contain any additional information
other than entity-relation sets, the attachment of context information was only available
for literature driven pathways.

We first searched for the defined context terms in the entire biomedical literatures.
For the case that a context term c was discovered in a literature L, we annotated all the relations that have been extracted
within L with c. When multiple context terms were found, we took the following approach. If the multiple
terms consist of distinct context types (OG, CT, DS and/or DR), we annotated corresponding
relations with the entire context set. If there are more than two context terms of
a same type, we separated them into a set of single context to generate new relations,
each of which is annotated with one context. In other words, context terms of different
types are connected by an AND-like operation (i.e. â€˜A increases B in liver under Aspirinâ€™) whereas context terms
of a same type are OR-like (i.e. â€˜A decreases B in liver or lungâ€™). This strategy is consistent with the
rationale that two same type contexts are not satisfiable simultaneously (i.e. there
is no liver and stomach organ) or are rarely combined in a literature (i.e. â€˜stomach
cancer and breast cancerâ€™ usually denotes two separated conditions).

Identifying conflicting information

Based on the integrated biological pathway set with context information, we search
for all conflicting (or contradictory) information that the set harbors.

Basic definitions

We define an interaction i as a set of two entities and the associated meta-relation:

i = {LEFT-ENTITY, RIGHT-ENTITY, META-RELATION}

, where the LEFT-ENTITY is a biomedical entity (i.e. gene or protein) that plays an acting role in the relation,
the RIGHT-ENTITY is a target biomedical entity, and the META-RELATION is the relation between LEFT-ENTITY and RIGHT-ENTITY mapped to either of the previously defined â€˜increaseâ€™ or â€˜decreaseâ€™ relationship (see
Methods 2.2). By this definition, one interaction corresponds to one edge in the context-free
biological pathway.

We define a rule r as a set of two entities, meta-relation between the entities and the associated context
information:

r = {LEFT-ENTITY, RIGHT-ENTITIY, META-RELATION, CT, OG, DS, DR}

, where CT, OG, DS, DR are sets of cell-type, organ, disease, drug context term(s) respectively. The context
term fields can be â€˜ullâ€™ where no information was extracted. To exemplify the rule, two sample rules are denoted
below, where r₁contains three context terms (CT, OG and DS) and r₂has no context information.

r₁= {protein X, gene Y, increase, epithermal cell, small intestine, T2D, â€˜ullâ€™}.

r₂= {protein A, gene B, decrease, â€˜ullâ€™, â€˜ullâ€™, â€˜ullâ€™, â€˜ullâ€™}

By definition, each interaction has a set of supporting rules (at least one rule). We define a rule group of an interaction i, R(i) as the set of rules the share the same LEFT-ENTITY and RIGHT-ENTITY:

R(i) = {r | r1]=i1], r2]=i2]}

, where r1], r2], i1], i2] denotes the LEFT-ENTITY and the RIGHT-ENTITY of the rule r and the interaction i (as the way of referring an element by the address in the set). The rule group is
further divided into two sub groups by the directionality of meta-relation. An increase rule group R(i)⁺and a decrease rule group R(i)^â€“of an interaction i are defined as:

R(i)⁺= {r | r1]=i1], r2]=i2], r3]=â€™increaseâ€™}

R(i)^â€“= {r | r1]=i1], r2]=i2], r3]=â€™decreaseâ€™}

, where r3] is the META-RELATION of the rule r.

Judgment of conflict

For an interaction i and its supporting rule group R(i), we generally expect the entire set R(i) to be uni-directed, which means the observed relations between two entities should
be consistent. Likewise, the coexistence of contradictory information (i.e. â€˜A increases
Bâ€™ and â€˜A decreases Bâ€™) is exceptional and needs to be resolved for a proper downstream
pathway analysis.

We call an interaction i has conflicting information where both of increase and decrease rules are contained in its supporting rule group.
Using the previous definition, this condition can be defined as:

N(R(i)⁺) 0 and N(R(i)^â€“) 0.

The umber of conflicts in i is calculated as the sum of the number of rules, N(R(i)⁺) + N(R(i)^â€“).

Judgment of context-dependent conflict

Using the context information, the effect range of a relation could be constrained
(see Methods 2.3). Thus, a conflict between two rules that have non-overlapping effect
ranges can be resolved. For instance, a rule â€˜A increases B in liverâ€™ and â€˜A decreases
B in adipocyteâ€™ can be non-contradictory. Based on the annotated context, we further
extend the definition of conflicting information. So, a context-dependent conflict arises only between rules that have identical context information. Formally, a rule subgroup of an interaction i, R(i, OG, CT, DS, DR) is defined as a subset of R(i) that have the corresponding context.

R(i, OG, CT, DS, DR) = {r | r1]=i1], r2]=i2], r4]=CT, r5]=OG, r6]=DS, r7]=DR}

The possible number of subgroups is the number of combinations among four contexts.

Similarly, an increase rule subgroup and a decrease rule subgroup can be defined:

R(i, OG, CT, DS, DR)⁺= {r | r1]=i1], r2]=i2], r3]=â€™increaseâ€™, r4]=CT, r5]=OG, r6]=DS, r7]=DR}

R(i, OG, CT, DS, DR)^â€“= {r | r1]=i1], r2]=i2], r3]=â€™decreaseâ€™, r4]=CT, r5]=OG, r6]=DS, r7]=DR}

Therefore, context-dependent conflict arises when an increase rule subgroup and the corresponding decrease rule subgroup are both non-empty.

Balance analysis among conflicting rules

We call two rule groups R(i)⁺and R(i)^â€“or two rule subgroups R(i, OG, CT, DS, DR)⁺and R(i, OG, CT, DS, DR)^â€“are competitive if they conflict. A balance relationship between two conflicting rule groups can be
inferred from the number of supporting rules. We call a rule group is strong if the number of the group is bigger than that of its competing group, and weak otherwise.

To see if a conflict is dominated by one rule group, we define a minor rule frequency MRF as the size ratio of weak rule group over the whole rule groups:

MRF = min(N(R(i)⁺), N(R(i)^â€“)) / (N(R(i)⁺) + N(R(i)^â€“))

, where min(A, B) is the smaller number between A and B. An MRF ear 0.5 denotes that two rule groups are supported in a similar level. A very small
MRF may implicit the weak rule group resulted from erroneous literature information or
mining procedures. We call a conflict is dominating if MRF 0.25 and balanced otherwise.