The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation

Clarifying and representing habitats

Interest in a given environmental system and the processes which change it is often driven by the desire to understand the ecology of the organisms that inhabit it. The relationship between populations of organisms and the one or more environmental systems needed to sustain their existence and growth is the foundation for the semantics of “habitat”. ENVO’s previous representation of habitats was underdeveloped, and many of its classes confounded the semantics of “environment” and “habitat”, primarily due to the loose usage of these terms across disciplines. Thus, as anticipated by Buttigieg et al. [1], ENVO’s semantically confounded habitat [ENVO_00002036] class has been made obsolete and replaced by the equivalently labelled, habitat [ENVO_01000739]. ENVO’s current habitat class represents an environmental system within which an ecological population (i.e. population [PCO_0000001]), can persist and grow. Importantly, a population of a given species (or similar grouping) need not be present in such an environmental system in order for that system to qualify as that species’ habitat: the environmental system need only have the disposition to support such a population.

Typically, subclasses of the current habitat will be formulated similarly to ‘Equus zebra habitat’, in that they will always reference some species or other grouping of organisms with similar physiological tolerances and environmental preferences. Habitats can be related to their constituent environments using the overlaps [RO_0002131] relation, as any given habitat will share parts with a range of environment types, according to the requirements of the species of interest. Organisms and populations of organisms can be associated with their habitats by the ‘has habitat’ [RO_0002303] relation, the definition of which was updated as a result of ENVO’s clarified representation of habitat. See Fig. 3 for illustration of these semantics.

Fig. 3

Typical formulation of an experimental habitat class in the ENVO’s automatically generated habitat content. Here, the environmental system which can support the survival and growth of a population of Equus zebra is represented. Based on the Encyclopedia of Life’s descriptions of the environments in which such populations are typically found, their habitat is represented as an environmental system which overlaps these environments. The boundaries of the habitat are determined by the physiological tolerances of the organisms: the habitat ends where the potential for an organism (or mating pair of organisms) to survive and/or increase their population size is no more. See text for comments on handling different definitions of ‘habitat’. Green borders indicate ENVO classes, while orange borders represent PCO classes

As most collections of organisms grouped at species (and even sub-species) level can be associated with a distinct habitat, the number of classes in this branch is likely to become very large and automated approaches are required to make reasonable progress. Thus, we created an experimental branch of ENVO based on the results of the ENVIRONMENTS-EOL project [31], which text-mined the habitat descriptions of the Encyclopedia of Life [39] and associated them with ENVO classes. This approach generated results for 227,583 taxa, associating them with 1,605,974 automatically generated annotations (“tags”) based on ENVO class labels and synonyms. We reduced this collection to 112,585 taxa by removing taxa which we were unable to link to a National Center for Biotechnology Information (NCBI) taxonomy entry via the EOL API. This filtering was performed to focus on taxa that we could readily map to a widely-used taxonomy which is integrated with genomic data. We acknowledge that other taxonomies and/or phylogenies may be more accurate, both globally and for specific taxa: initiatives such as the Tree of Life Web Project [40], PhylomeDB [41], and TreeBase [42, 43] are of great interest in enhancing this dimension of our habitat hierarchy, and we will work towards integrating additional taxonomic resources in future releases. We then chose to focus our attention on taxa which face threats to their persistence by retaining only those taxa which feature in the International Union for Conservation of Nature (IUCN) Red List [44] as extinct in the wild (EW), regionally extinct (RE), critically endangered (CR), or endangered (EN), yielding 5,117 taxa. Due to their experimental nature, the results of this exercise are stored in a separate repository [45] and classes exist in a temporary ID range (prefixed with “ENVO:H”). The complete result set may be retrieved from [46].

Our automatic mapping provides a foundation upon which high-quality semantic resources can be created for linking organisms to the environments which sustain their populations. However, this automatic mapping is prone to error and must be refined. Erroneous mappings have been identified due to simple false positives, ambiguous class labels, and text-mining routines which only account for the basic structure of the ontology. False positives can easily arise from the parsing of place names such as “Mountain River” or from other largely unpredictable facets of natural language. An example of the latter two issues was apparent in the overly narrow association of class label ‘pelagic zone’ [ENVO_00000208] to marine ecosystems. Large lakes are also said to have pelagic zones, however, workers in both marine and lacustrine domains will generally omit labels with qualifiers such as “marine pelagic zone”. Within ENVO, we decided to err on the side of caution and employ such modifiers, maintaining “pelagic zone” as a broad synonym associated with each class. Enhanced text-mining techniques, such as natural language processing (NLP), statistical analysis of text-mining results, and additional filtering based on a term’s ontological context, could further reduce false positives. We have yet to explore the feasibility of this solution with a rapidly developing ontology like ENVO, but we are encouraged by the promise of semi-automated ontology growth in the environmental domain.

While ENVO’s preliminary habitat representation shows promise, we stress that refinement and curation are needed before habitat classes will be added to the release version of the ontology. We will solicit input from experts on particular species and their environmental preferences in order to validate our mapping, report poor representations, and request enhancements via the ENVO issue tracker [25]. Building on these initial results, we aim to enable semantically controlled, large-scale habitat analyses driven by text-mining as described by authors such as Groom [47]. Eventually, we anticipate that coupling habitat semantics with distributional e.g. [48], trait e.g. [19], or behavioural data will offer further opportunities in predicting multi-scale patterns of biodiversity.

Importantly, it must be acknowledged that there will be some ambiguity in what constitutes an ecological population and, hence, what environments can provide a habitat for its members. Further, definitions of “habitat” also vary (see, e.g. [49, 50]), increasing the need for structured representation of the semantics behind the entity. These issues are further complicated by the decoupling of phylogeny from function due to, for example, horizontal gene transfer in microbes as well as procedural issues in stably identifying units of diversity [5153] along with the role of microdiversity [54, 55]. Definitional variation and ambiguous boundary conditions are not unusual in the representation of environmental entities. ENVO will remain agnostic regarding any definition’s ‘correctness’ and we anticipate that co-existing and semantically overlapping habitat classes will emerge to represent the entities referenced by different communities. Addressing this challenge will be greatly helped by ENVO’s increased semantic flexibility, described in the sections above, which will be leveraged to tease apart this space. Through this process of representation, we hope that ENVO will serve as a hub for healthy and structured debate over central ecological entities such as habitats and niches, while simultaneously providing a resource to mobilise data in transparent ways. As a final but important note, we frequently encountered information indicating the typically deleterious impact of human activity on habitats. This, along with the need to provide semantics for defining anthropised environments, has motivated updates in ENVO’s representation of human-centric environmental systems and processes, as we describe below.