Entity linking for biomedical literature

In this section we will present our EL approach to the biomedical domain. A more detailed description of this system with applications to other domains is found
in
 26].

Overview

In the discussion that follows, we first define some basic concepts, notations, and
preliminary background and then give an overview of the EL system. The entity mentions
m ? M are the prominent phrases in the full text of a scientific paper. We consider all
classes, properties, and individuals as described in the ontologies e ? E to be the reference entities, which are used to ground the entity mentions. Each entity
is described by a surface form dictionary that contains all phrases matching its string.
For example, the entity “IKK” is an entry in E, whereas an occurrence of “IKK” in a scientific paper is an entity mention. Furthermore, an occurrence of “I?B kinase” is one surface form of “IKK” because it’s a synonym of “IKK“.

The overall approach is depicted in Figure 1. We first construct a knowledge base (described in the following section ). Next,
given a textual document d, we extract the entity mentions M : {m1, m2, …mn} as described in section 3.2. We then construct a graph representation Gd = ?V, R? for d, where V = {v1, v2, …vn} is the set of vertices, each vertex v represents an entity mention in d, and R = {r1, r2, …rn} is the set of edges. (Note: Gd refers to the graph of document d whereas Gk refers to the graph of the knowledge base.) The vertices v1 and v2 are connected by an edge denoted as ?(v1, v2, r) if and only if the entity mentions for v1 and v2 are related to each other. Here, such a relation is obtained by analyzing the document
d. For this work, we extract relations based on sentence-level or paragraph-level co-occurrence.
Then, for each entity mention m, we use the surface form dictionary to locate a list of candidate entities c ? C for entity mentions in graph Gd and compute an importance score by the non-collective approach detailed in section
3.5. Finally we compute similarity scores for each entity mention/candidate entity
pair ?m, c? and select the candidate with the highest score as the appropriate entity for linking.

Figure 1. Approach Overview. An illustration of the approach described for analysis of a document.

Knowledge Base graph construction

We utilize a very broad definition of a Knowledge Base (KB). A Knowledge Base is a
data set that contains some, potentially limited, structured content along with unstructured
content.

Using this broad definition, Wikipedia is a popular knowledge base that is often used
for entity linking because it contains structured information such as titles, hyperlinks,
infoboxes as well as unstructured texts. However, in order to take advantage of richer
structures and domain knowledge which are not offered by Wikipedia, we constructed
a knowledge base from 300 biology-related ontologies from BioPortal 6]. Based on the rich structure contained in these ontologies, we created a web of data
(WOD). In the WOD, each entity e is described as a set of triples t ? T . For example, a triple (_:Nucleus, _:PartOf, _:CellComponent ) indicates that the entity “nucleus” is “part of” the entity “cell“.

Our expanded knowledge base E was constructed using a graph-based approach. E consists of classes, individuals, and properties in the aggregated ontologies. Each
entity e is regarded as a vertex in the knowledge graph Gk . Using our WOD, each entity is connected to other entities via a set of triples T . These connections are regarded as the edges of Gk . For example, the entities “phosphorylating“, “IKK“, and “I?B kinase activity” contained in the GeneOntology 27] are treated as the vertices of our graph. The triples (_:I?B kinase activity, _:subClassOf, _:phosphorylating ) and (_:I?B kinase activity, _:relatedTo, _:IKK ) are treated as edges between the vertex “I?B kinase activity” and other vertices in our graph.

Mention extraction

The focus of the paper is to link identified mentions to the concepts in the knowledge
base. Therefore, for identifying prominent mentions from unstructured texts, we apply
various publicly available natural language processing tools. First a name tagger
28] is used to extract entity mentions. Regular expressions are used to join named entities
that might have been considered separate by looking for intervening prepositions,
articles, and punctuation marks. Then, a shallow parser 29] is used to add noun phrase chunks to the list of mentions. A parameter controls the
minimum and maximum number of chunks per mention (one and five by default), and whether
overlapping mentions are allowed. Although domain-specific named entity recognition could improve the overall performance
of the system, this was not investigated since our focus was on the entity linking
problem in this work
.

Entity candidate retrieval

By analyzing the triples describing the entities, we also construct a surface form
dictionary (f, {e1 , e2…ek }) where {e1, e2…ek } is the set of entities with surface form f. We analyzed the following main properties: labels and names (e.g. rdfs:label), synonyms
(e.g. exact synonym from gene ontology), aliases, and symbols (e.g. from Orphanet
ontology), providing us with more than 150 properties to construct the surface form
dictionary. During the candidate retrieval process, we retrieve all entities with
surface forms that are similar to the mentions’ surface form, and considered them
as candidates for the mentions.

Non-collective entropy rank

The candidate entities retrieved from the knowledge base are pre-ranked using an entropy-based
non-collective approach. The main idea of the algorithm is to assign the entities
with higher popularity a higher score. While entities in Wikipedia are universally
connected with the same type of link, entities in the ontologies are potentially connected
with many kinds of links that may have semantically rich definitions. We can leverage
this greater degree of specificity and assign different weights to edges described
by different properties. For example, consider the triples (_: IKK, _:isCapableOf, _:phosphorylation) and (_:IKK, _:locatedIn, _:cytoplasm). Since “phosphorylation” and “cytoplasm” are connected to “IKK” by different relations, we consider their influence on the importance of “IKK” to be different.

To capture such differences in influence, we compute the entropy of relations H(p) 30] as

(1)

where p ? P is a property or relation that has a value op ? Op or links to an object op ? Op and ?(op) is the probability of obtaining o given the property p. The entropy measure has been used in many ranking algorithms to capture the salience
of information 31,32], therefore, in our task, we used it to capture the saliency of a property. In the
previous example, p indicates “is capable of” and “located in” while o indicates “IKK” and “cytoplasm” respectively. Then H(“iscapableof“) and H(“locatedin“) are the influence factors between “IKK” and “phosphorylation“, and “IKK” and “cytoplasm” respectively.

We then compute the salience score of candidate entities using the following non-collective
EntropyRank:

(2)

where Pc is the set of properties describing a candidate entity c and is number of entities linked to . The EntropyRank for each entity starts at 1 and is recursively updated until convergence.
This equation is similar to PageRank 33], which gives higher ranks to the popular entities, but we also take the difference
of influence of neighbor nodes into consideration.

As described previously, the candidate entities are retrieved from the surface form
dictionary based on the above salience measure. Most often, the exact surface form
match between an entity mention and a candidate entity cannot be found. However, our
rank model allows partial surface form matches with a penalty. Currently we use Jaccard
Similarity to compute partial match scores. For example, Jaccard Similarity will be
computed for mention “nucleus” and entity “neural nucleus”. In the equation below,
JS(m, e) is the Jaccard Similarity score between the surface form of entity mention m and the surface form of candidate entity c.

(3)

Collective inference

In the non-collective inference approach, each entity mention is analyzed, retrieved,
and ranked individually. Although this approach performs well in many cases, sometimes
incorrect entity mention/entity links are formed due to the lack of context information.
Therefore, we adopt a collective inference approach, which analyzes relations among
multiple entity mentions and ranks the candidates simultaneously. For example, given
the sentence that contains the entity mentions “phosphorylating” and “IKK“, the collective approach will analyze the two mentions simultaneously to determine
the best reference entities.

In Section 3.1, we presented how we construct the document graph Gd. Using the connected Gd and candidate entities retrieved from the non-collective approach, we can compute
the similarity between each entity mention m from Gd and a candidate entity c from Gk . Both m and c are connected to sets of neighbor nodes, which provide important contextual descriptions
for both m and candidate entity c, respectively. We then use the following equation to compute the similarity score:

(4)

Here, is the set of neighbors with equivalent surface form between the Gk subgraph for candidate c and Gd subgraph for mention m. The parameters ? and ? are used to adjust the effects of the candidate pre-ranking score and the context
information score on the overall similarity score. Based on the optimization results
reported by Zheng et al. 26], we empirically set ? = 15 and ? = 8 for all experiments. The equation captures two important ranking intuitions: 1.
the more popular a c is, the higher rank it will be, as captured by ER, 2. the more similar between the Gk subgraph for c and Gd subgraph for mention m, then higher rank will be given to c, which is captured by latter part of the equation.

To better describe the use of this system for the life science domain, we provide
an illustrative example in Figure 2. For the example sentence provided, the document graph Gd has vertices V that correspond to entity mentions M . For this sentence-level collective inference approach, there exist edges between
all vertices since these mentions co-occur in the sentence. We then retrieve our knowledge
graph Gk from our knowledge base. Focusing our attention on reference entity “STAT3“, a term-level search returns candidate “STAT3“. However, because “Activated STAT3” is connected to more vertices of Gk , it is intuitive that this candidate’s rank increases with collective inference.
Furthermore, although candidate “Neural Nucleus” is indirectly linked to “Nerve Impulse” which is in turn linked to candidate “Nervous Tissue“, the isolation of “Neural Nucleus” from candidates of other entities enables candidate entity “Cell Nucleus” to obtain the highest rank.

Figure 2. Illustrative Example. In the document graph, entity mentions are circled. In the knowledge graph, reference
entities are bolded, the candidate entities with the highest ranks are circled with
solid lines, and candidates with lower ranks are circled with dashed lines. Boxes
indicate intermediates.