ODMedit: uniform semantic annotation for data integration in medicine based on a public metadata repository

Personalised medicine is data-intensive [22], but not only regarding genomic data. In contrast to genomic profiles, few attention has been given so far to the complexity and heterogeneity of patient data, the “phenotype”. An indicator for this complexity is the number of concepts in medical terminologies: 300,000 concepts in SNOMED CT, 2 million concepts in UMLS. The grand challenge of semantic interoperability between medical data sources is well-known for decades [23]. However, UMLS, SNOMED CT and many other medical terminologies – in contrast to classifications like ICD – don’t provide uniform coding, i.e., there can be several matching codes for a data element. To some extent, matching is subject to interpretation. For example, height can be coded in UMLS as C0489786 or C0005890. To foster data integration, uniform coding is highly desirable: I.e. the same code for “height” in any information system, when it is a candidate for data integration. For this purpose, it doesn’t matter whether code C0489786 or C0005890 is chosen, but it should be the same code in any system. For this reason ODMedit is connected to a large metadata repository and human experts can choose from a list of semantic annotations for potential re-use. Our evaluation demonstrates that several synonymous codes exist for a data element in UMLS (in our case between 9 and 23); therefore uniform annotation – the same code for the same meaning – is a challenging task. Uniform coding was achieved in one data element. Between two and three coding variants were identified for the remaining data elements, which is not yet perfect, but a lot better than a direct UMLS search with hundreds of potential hits. Future work shall address number and relevance of proposed concepts by ODMedit. In addition, interrater reliability of coding with ODMedit shall be assessed. It has to be taken into account that the final decision about coding is taken by human experts, because fully automated coding approaches have limitations [24]. More formal guidance how to assign uniform codes needs to be developed in the future, which is a lot of work given the amount of terms in UMLS.

The public discussion about patient data is dominated by data protection and privacy issues, which are absolutely important. Maybe as a side effect of this discussion, the vast majority of patient metadata – and implicitly their semantic annotation – is currently also kept secret [9]. However, this is a roadblock to semantic interoperability: It is simply impossible to integrate patient data between systems and share best practice in medical data structures if the available data items are kept secret.

The second important challenge for patient data integration is semantic annotation, because only data items with the same meaning shall be merged. The benefits of UMLS-based semantic annotation for data integration have been described previously [25]. Ideally, semantic annotation should be done in the very beginning by the author of each data item, because he or she is the one to know what exactly is meant by each data item. However, most patient data sources do not yet provide semantic codes. A first step is semantic annotation of metadata at a later stage with dedicated methods like ODMedit to facilitate data integration. Given the semantic richness of patient data requiring millions of codes, an international collaborative effort is needed to develop and maintain these annotations. UMLS was chosen, because it is composed from more than 100 major source vocabularies and therefore outperforms other terminologies regarding overall coverage of concepts. UMLS provides more than 4 million terms, but for some data elements like “Door-to-balloon time” [26] (regarding percutaneous coronary interventions) or “history of ibuprofen” an appropriate code is not yet available. Postcoordination, i.e., combination of several codes, helps to deal with semantic richness of patient data, but impedes uniform annotation, because different approaches how to perform postcoordination are available. The relationship between UMLS terms (UMLS Semantic Network) is currently not taken into account within ODMedit. There is an ongoing debate about the quality of hierarchical structures in major vocabularies from UMLS such as SNOMED CT: “The SNOMED CT hierarchies cannot be relied upon in their present state in our applications.” [27] The goal of uniform semantic annotation is to determine whether two data elements from different sources have the same meaning – yes or no. When two data elements have similar, but not identical meaning – as indicated by different semantic annotations –, review by domain experts is useful to assess whether data integration is feasible.

The context of a data element needs to be taken into account for appropriate semantic annotation. For example, an item “complication” can have a very different meaning in a controlled trial and an EHR system. ODMedit provides access to the complete documentation form for each annotation code, thereby enabling manual review of context.

Related work

The proposed annotation tool ODMedit is based upon the CDISC ODM standard. There are several international resources available for data elements with semantic annotations, including cancer Data Standards Registry and Repository (caDSR) from NCI [28], OpenEHR Clinical Knowledge Manager [29] and Clinical Information Modeling Initiative (CIMI) [30]. However, these resources are currently not based on the ODM standard, which is recommended for metadata and data transfer by regulatory authorities for clinical research [12].

Mapping clinical data elements to controlled terminologies has been described in the literature. For instance, within the eMERGE network 157 data elements from 5 sites were mapped to caDSR, CDISC SDTM, NCI-T and SNOMED CT using a dedicated toolkit called eleMAP [31]. Another approach for mapping local data elements to standard vocabularies was proposed by German [32]. It is based on full text search in a dedicated ontology (like LOINC for laboratory values) combined with review by a local terminology expert. This publication clearly identifies the need for uniform semantic annotation: “The largest barrier to linking knowledge-based medical decision support systems to heterogeneous DBs is the variety of ways in which similar data are represented”. Our approach is based on a public metadata repository of annotated data elements. These annotations are re-used and only for new data elements an annotation code from UMLS is identified. Thereby uniform annotation is supported. Mapping eligibility criteria of clinical trials to semantic annotations is a complicated task, because many of these criteria are complex and sometimes ambiguous. In addition to NLP techniques manual processing is needed in many cases [33].

ODMedit is working on metadata only, not on data. For this reason it is a different approach than most semantic web approaches, addressing a web of linked data [34]. Approaches for mapping of ontologies to clinical databases have been described previously, for example ONTOFUSION [35]. ONTOFUSION applies automated unification of semantic codes. In contrast, ODMedit uses expert-based coding based upon a large-scale metadata repository of medical data models. According to our experience with 1,000 models, available medical terminologies are not yet in a development stage that allows fully automated and reliable semantic coding.

Many powerful tools to support patient data integration are already available, such as Internet technology to share metadata and connect information systems, metadata standards like CDISC ODM as well as sophisticated medical terminologies like SNOMED CT or UMLS to annotate data items. ODMedit demonstrates that uniform semantic annotation of patient data is a challenging task, but feasible in principle. This annotation can facilitate data integration [24]. However, adoption by the scientific community is needed to make an impact. As a first step, ODMedit is connected to the largest public portal of medical data models. ODMedit is limited to structural metadata. Therefore data-related aspects of data integration, for example data completeness, are not addressed by this tool.

Much more awareness is needed regarding the benefits of open metadata in medicine and beyond to overcome the currently existing silos of complex, non-standardised systems. Patient data is sensitive and needs to be de-identified appropriately before sharing – but structural metadata should be open and semantically annotated for the scientific community and all citizens.