SciData: a data model and ontology for semantic representation of scientific data

For almost 40 years, scientists have been storing scientific data on computers. With the advent of the Internet, research data could be shared between scientists, first via email and later using web pages, FTP sites, and online databases. With the advancement of Internet technologies and online and local storage capabilities, the options for collecting and stored scientific information have become unlimited.

Yet, with all these advancements science faces an increasingly important issue of interoperability. Data are commonly stored in different formats, organized in different ways, and available via different tools/services severely impacting curation [2]. In addition, data is often without context (no metadata describing it), and if there is metadata it is minimal and often not based on standards. Though the Internet has promoted the creation of open standards in many areas, scientific data has, in a sense, been left behind because of its inherent complexity. The strange part about this scenario is that scientific data itself is not the biggest problem. The problem is the contextualization of the scientific data—the metadata that describes system that it applies to, the way it was investigated, the scientists that determined it, and the quality of the measurements.

So, what is scientific data and where is the metadata? Peter Murray-Rust grappled with these questions in 2010 and concluded that it is “factual data that shows up in research papers” [3]. When writing scientific articles, researchers add most (in most cases not all) of the valuable metadata in the description of the research they have performed. The motivation of course is open sharing of knowledge for the advancement of science, with appropriate attribution and provenance of research work. As we move toward the fourth paradigm [4], where large aggregations of data are the key to discovery, it is imperative that the context of the data are articulated completely (or as completely as possible), not only to identify it’s origin and authenticity, but more importantly to allow the data to be located correctly on the “scientific data map”.

To address these issue, this paper describes a generic scientific data model (SDM)/framework for scientific data derived from (1) the common structure of scientific articles, (2) the needs of electronic notebooks to capture scientific research data and metadata, and (3) the clear need to organize scientific data and its contextual descriptors (metadata). The SDM is intended to be data format/software agnostic and extremely flexible, so that it can be implemented as the scientific research dictates. While the SDM is abstract in nature, it defines a concrete framework that can be easily implemented in any database and does not constrain the data and metadata that can be stored. It therefore serves as a backbone upon which data and its associated metadata can be ‘attached’.

In addition, this paper describes an ontology that defines the terms in the SDM, which can be used to semantically annotate the structure of the data reported. In this way, scientific data can be integrated together by storage in Resource Description Framework (RDF) [5] triple stores and searched using SPARQL Protocol and RDF Query Language (SPARQL) queries [6].

The use of the ontology in the generation of RDF is demonstrated in examples of scientific data saved in JavaScript Object Notation (JSON) for Linked Data (JSON-LD) [7] format using the framework described by the SDM. From these examples it is shown how useful a hybrid structured (relational)/graph (unstructured) approach is to the representation of scientific data.

JSON-LD is a recent solution to allow transfer of any type of data via the web’s architecture—Representational State Transfer (REST) [8]—using a simple text-based format—JSON. [9] JSON-LD allows data to be transmitted with meaning, that is, the “@context” section of a JSON-LD document is used to provide aliases to the names of data reported and link them to ontological definitions using a Uniform Resource Identifier (URI)—often a Uniform Resource Locator (URL). In addition, the structure/data of the JSON-LD file can be automatically be serialized to Resource Description Format (RDF) using a JSON-LD processor, e.g. the JSON-LD Playground [10]. This capability makes JSON-LD files not only useful as a data format but also a compact representation of the meaning of the data.