Entity linking for biomedical literature

May 20, 2015

In this section we will present our EL approach to the biomedical domain. A more detailed description of this system with applications to other domains is found
inÂ 26].

Overview

In the discussion that follows, we first define some basic concepts, notations, and
preliminary background and then give an overview of the EL system. The entity mentions
m ? M are the prominent phrases in the full text of a scientific paper. We consider all
classes, properties, and individuals as described in the ontologies e ? E to be the reference entities, which are used to ground the entity mentions. Each entity
is described by a surface form dictionary that contains all phrases matching its string.
For example, the entity â€œIKKâ€ is an entry in E, whereas an occurrence of â€œIKKâ€ in a scientific paper is an entity mention. Furthermore, an occurrence of â€œI?B kinaseâ€ is one surface form of â€œIKKâ€ because itâ€™s a synonym of â€œIKKâ€œ.

The overall approach is depicted in Figure 1. We first construct a knowledge base (described in the following section ). Next,
given a textual document d, we extract the entity mentions M : {m₁, m₂, â€¦m_n} as described in section 3.2. We then construct a graph representation G_d= ?V, R? for d, where V = {v₁, v₂, â€¦v_n} is the set of vertices, each vertex v represents an entity mention in d, and R = {r₁, r₂, â€¦r_n} is the set of edges. (Note: G_drefers to the graph of document d whereas G_krefers to the graph of the knowledge base.) The vertices v₁and v₂are connected by an edge denoted as ?(v₁, v₂, r) if and only if the entity mentions for v₁and v₂are related to each other. Here, such a relation is obtained by analyzing the document
d. For this work, we extract relations based on sentence-level or paragraph-level co-occurrence.
Then, for each entity mention m, we use the surface form dictionary to locate a list of candidate entities c ? C for entity mentions in graph G_dand compute an importance score by the non-collective approach detailed in section
3.5. Finally we compute similarity scores for each entity mention/candidate entity
pair ?m, c? and select the candidate with the highest score as the appropriate entity for linking.

Figure 1. Approach Overview. An illustration of the approach described for analysis of a document.

Knowledge Base graph construction

We utilize a very broad definition of a Knowledge Base (KB). A Knowledge Base is a
data set that contains some, potentially limited, structured content along with unstructured
content.

Using this broad definition, Wikipedia is a popular knowledge base that is often used
for entity linking because it contains structured information such as titles, hyperlinks,
infoboxes as well as unstructured texts. However, in order to take advantage of richer
structures and domain knowledge which are not offered by Wikipedia, we constructed
a knowledge base from 300 biology-related ontologies from BioPortal 6]. Based on the rich structure contained in these ontologies, we created a web of data
(WOD). In the WOD, each entity e is described as a set of triples t ? T . For example, a triple (_:Nucleus, _:PartOf, _:CellComponent ) indicates that the entity â€œnucleusâ€ is â€œpart ofâ€ the entity â€œcellâ€œ.

Our expanded knowledge base E was constructed using a graph-based approach. E consists of classes, individuals, and properties in the aggregated ontologies. Each
entity e is regarded as a vertex in the knowledge graph G_k. Using our WOD, each entity is connected to other entities via a set of triples T . These connections are regarded as the edges of G_k. For example, the entities â€œphosphorylatingâ€œ, â€œIKKâ€œ, and â€œI?B kinase activityâ€ contained in the GeneOntology 27] are treated as the vertices of our graph. The triples (_:I?B kinase activity, _:subClassOf, _:phosphorylating ) and (_:I?B kinase activity, _:relatedTo, _:IKK ) are treated as edges between the vertex â€œI?B kinase activityâ€ and other vertices in our graph.

Mention extraction

The focus of the paper is to link identified mentions to the concepts in the knowledge
base. Therefore, for identifying prominent mentions from unstructured texts, we apply
various publicly available natural language processing tools. First a name tagger
28] is used to extract entity mentions. Regular expressions are used to join named entities
that might have been considered separate by looking for intervening prepositions,
articles, and punctuation marks. Then, a shallow parser 29] is used to add noun phrase chunks to the list of mentions. A parameter controls the
minimum and maximum number of chunks per mention (one and five by default), and whether
overlapping mentions are allowed. Although domain-specific named entity recognition could improve the overall performance
of the system, this was not investigated since our focus was on the entity linking
problem in this work.

Entity candidate retrieval

By analyzing the triples describing the entities, we also construct a surface form
dictionary (f, {e₁, e₂â€¦e_k}) where {e₁, e₂â€¦e_k} is the set of entities with surface form f. We analyzed the following main properties: labels and names (e.g. rdfs:label), synonyms
(e.g. exact synonym from gene ontology), aliases, and symbols (e.g. from Orphanet
ontology), providing us with more than 150 properties to construct the surface form
dictionary. During the candidate retrieval process, we retrieve all entities with
surface forms that are similar to the mentionsâ€™ surface form, and considered them
as candidates for the mentions.

Non-collective entropy rank

The candidate entities retrieved from the knowledge base are pre-ranked using an entropy-based
non-collective approach. The main idea of the algorithm is to assign the entities
with higher popularity a higher score. While entities in Wikipedia are universally
connected with the same type of link, entities in the ontologies are potentially connected
with many kinds of links that may have semantically rich definitions. We can leverage
this greater degree of specificity and assign different weights to edges described
by different properties. For example, consider the triples (_: IKK, _:isCapableOf, _:phosphorylation) and (_:IKK, _:locatedIn, _:cytoplasm). Since â€œphosphorylationâ€ and â€œcytoplasmâ€ are connected to â€œIKKâ€ by different relations, we consider their influence on the importance of â€œIKKâ€ to be different.

To capture such differences in influence, we compute the entropy of relations H(p) 30] as

(1)

where p ? P is a property or relation that has a value o_p? O_por links to an object o_p? O_pand ?(o_p) is the probability of obtaining o given the property p. The entropy measure has been used in many ranking algorithms to capture the salience
of information 31,32], therefore, in our task, we used it to capture the saliency of a property. In the
previous example, p indicates â€œis capable ofâ€ and â€œlocated inâ€ while o indicates â€œIKKâ€ and â€œcytoplasmâ€ respectively. Then H(â€œiscapableofâ€œ) and H(â€œlocatedinâ€œ) are the influence factors between â€œIKKâ€ and â€œphosphorylationâ€œ, and â€œIKKâ€ and â€œcytoplasmâ€ respectively.

We then compute the salience score of candidate entities using the following non-collective
EntropyRank:

(2)

where P^cis the set of properties describing a candidate entity c and is number of entities linked to . The EntropyRank for each entity starts at 1 and is recursively updated until convergence.
This equation is similar to PageRank 33], which gives higher ranks to the popular entities, but we also take the difference
of influence of neighbor nodes into consideration.

As described previously, the candidate entities are retrieved from the surface form
dictionary based on the above salience measure. Most often, the exact surface form
match between an entity mention and a candidate entity cannot be found. However, our
rank model allows partial surface form matches with a penalty. Currently we use Jaccard
Similarity to compute partial match scores. For example, Jaccard Similarity will be
computed for mention â€œnucleusâ€ and entity â€œneural nucleusâ€. In the equation below,
JS(m, e) is the Jaccard Similarity score between the surface form of entity mention m and the surface form of candidate entity c.

(3)

Collective inference

In the non-collective inference approach, each entity mention is analyzed, retrieved,
and ranked individually. Although this approach performs well in many cases, sometimes
incorrect entity mention/entity links are formed due to the lack of context information.
Therefore, we adopt a collective inference approach, which analyzes relations among
multiple entity mentions and ranks the candidates simultaneously. For example, given
the sentence that contains the entity mentions â€œphosphorylatingâ€ and â€œIKKâ€œ, the collective approach will analyze the two mentions simultaneously to determine
the best reference entities.

In Section 3.1, we presented how we construct the document graph G_d. Using the connected G_dand candidate entities retrieved from the non-collective approach, we can compute
the similarity between each entity mention m from G_dand a candidate entity c from G_k. Both m and c are connected to sets of neighbor nodes, which provide important contextual descriptions
for both m and candidate entity c, respectively. We then use the following equation to compute the similarity score:

(4)

Here, is the set of neighbors with equivalent surface form between the G_ksubgraph for candidate c and G_dsubgraph for mention m. The parameters ? and ? are used to adjust the effects of the candidate pre-ranking score and the context
information score on the overall similarity score. Based on the optimization results
reported by Zheng et al. 26], we empirically set ? = 15 and ? = 8 for all experiments. The equation captures two important ranking intuitions: 1.
the more popular a c is, the higher rank it will be, as captured by ER, 2. the more similar between the G_ksubgraph for c and G_dsubgraph for mention m, then higher rank will be given to c, which is captured by latter part of the equation.

To better describe the use of this system for the life science domain, we provide
an illustrative example in Figure 2. For the example sentence provided, the document graph G_dhas vertices V that correspond to entity mentions M . For this sentence-level collective inference approach, there exist edges between
all vertices since these mentions co-occur in the sentence. We then retrieve our knowledge
graph G_kfrom our knowledge base. Focusing our attention on reference entity â€œSTAT3â€œ, a term-level search returns candidate â€œSTAT3â€œ. However, because â€œActivated STAT3â€ is connected to more vertices of G_k, it is intuitive that this candidateâ€™s rank increases with collective inference.
Furthermore, although candidate â€œNeural Nucleusâ€ is indirectly linked to â€œNerve Impulseâ€ which is in turn linked to candidate â€œNervous Tissueâ€œ, the isolation of â€œNeural Nucleusâ€ from candidates of other entities enables candidate entity â€œCell Nucleusâ€ to obtain the highest rank.

Figure 2. Illustrative Example. In the document graph, entity mentions are circled. In the knowledge graph, reference
entities are bolded, the candidate entities with the highest ranks are circled with
solid lines, and candidates with lower ranks are circled with dashed lines. Boxes
indicate intermediates.

Overview

Knowledge Base graph construction

Mention extraction

Entity candidate retrieval

Non-collective entropy rank

Collective inference

You May Also Like

Danny Pintauro: â€˜Iâ€™m Not Here to Tell the Story They Want Me to Tellâ€™

New Recommendations for Managing Menopausal Symptoms in Breast Cancer Survivors

Carly Fiorina: "Always the parentâ€™s choice" to vaccinate kids