Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data


Public databases

A number of public databases are currently available, but there are three prominent
examples which represent important data quality approaches: PubChem, ChemSpider and
ChEBI 27]. PubChem is the largest of the three, and like ChemSpider, it provides a method for
user-driven release of chemical structures and associated data. ChEBI, on the other
hand, makes use of a careful data curation process, which greatly increases the odds
of an individual record being accurate, but it also keeps the data size much smaller,
and passes on a significant expense to the data maintainer. PubChem and ChemSpider
both aspire to be comprehensive databases, but differ in their quality strategies:
PubChem relies primarily on upstream data correction, while ChemSpider is novel in
that it has incorporated a significant post-upload “crowd curation” component. While
both of these strategies are sound in principle, and have worked well in many other
fields, they have met with modest success with regard to keeping quality high for
millions of chemical compounds. In the case of PubChem, the problem is that many data
depositors provide content that they have no immediate responsibility for, i.e. it
had already been laundered through many error prone curation techniques and lost all
provenance, and the depositor has no motivation to correct problems. Figure 1 shows a portion of the PubChem search results for aspirin and cholesterol respectively,
where each of the vendor-supplied links to product pages is marked in red for unavailable
links. Whether the broken links indicate that the submitting vendor is no longer in
business, no longer selling the product, or has changed the identifier without resubmitting
a new one, is not known. It should be noted that this issue of “link decay” is a general
issue for all data aggregators.

Figure 1. Vendor links for PubChem records of aspirin (a) and cholesterol (b). Links shown in red were unavailable or broken when verified (October 2014).

In the case of ChemSpider, many tens of thousands of erroneous data contained within
large depositions have been edited out by pre-filtering routines using desktop software
28] and there are many examples where known molecules have been fixed by users of the
service, but this approach runs into issues with scale: there are tens of millions
of records, and the number of end users motivated to report problems numbers only
in the low hundreds 29]. More troublesome, though, is that the crowd curation approach is generally only
effective for spotting errors that are either obvious or popular. For example, a search
for deuterated ammonium in ChemSpider reveals a problematic result: Figure 2(a) shows the result when it is discovered using the ChemSpider Mobile app 30], which acquires the structure via the public API 31], which delivers a structure representation that lacks the isotope information for
the 4 deuterium atoms. Figure 2(b) shows the same result using the web browser, which renders a structure that correctly
identifies the isotopes, but omits the negative charge on the bromine atom. Given
the context information, most importantly the name and list of synonyms, it is quite
obvious to a chemist that it is the structure that is wrong, and a solution can easily
be proposed. This is an example of where crowd sourced quality improvement works quite
well, because no contextual expertise is required to identify both the mistake and
the correct answer, and indeed the data may well have been corrected by the time this
manuscript is published.

Figure 2. Two ChemSpider search results: deuterated ammonium bromide, (a) and (b), and aminophylline,
(c) and (d).
Examples (a) and (c) show the result as accessed by the ChemSpider Mobile app using the public API, while
examples (b) and (d) show the web browser result page.

The second example, shown in Figure 2(c) and (d) for mobile and web results respectively, shows a structural representation
of a drug material: aminophylline. It is not immediately obvious that there is anything
wrong with the structure, given that the active ingredient is drawn correctly, and
the adduct is present. However, the synonyms that have been imported for the record
are quite explicit about the adduct ratio being 2:1, which makes the structure inconsistent
with the primary name and all of the synonyms. It is not necessarily clear to most
chemists whether the structure should be modified, or the name, or whether the distinction
is important. In this case a resolution is quite likely to be obtained, because the
compound in question is a well studied drug, and there is a fair chance that an expert
with specific knowledge will encounter the datum and be able to provide an authoritative
correction. However there are tens of millions of compounds that are much more obscure,
and any of them could have an accidental extra methylene, an incorrect chiral centre,
or any number of errors that encode for molecules that are valid drawings. A structure
that is valid, but represents the wrong molecule, will never be corrected by an algorithm,
and will probably never be encountered by the handful of scientists who were involved
in its use or synthesis. In these large majority of cases, the mechanism by which
the corrupted data was injected into the greater corpus of knowledge has created a
permanent stain.

Vendor catalogs

One of the major factors behind the increasing availability of chemical structure
data is the business value that is associated with a vendor of chemicals making their
wares as easy as possible to find. Making it simple for anyone to incorporate structures
and product identifiers into a generic chemical searching service is a clear value
proposition, and an additional encouraging feature of these data sources is that a
company that is responsible for selling a physical package with a particular chemical
compound has a high degree of liability, and is hence motivated to make a reasonable
effort to correctly represent the data.

Unfortunately these primary sources are diluted by providers of large scale high throughput
screening data, who prime their customers to expect a certain degree of noise, and
the potentially poor state of the informatics component is just one of the many failure
modes for experiments that are designed to use quantity to compensate for lack of
quality. And, perhaps more problematic, are the large number of companies that collate
vendor catalogues from many other sources, losing much of the provenance along the
way, and introducing layers of errors that cannot be traced back to a single source.
Many of these repackaged vendor libraries have been submitted to public databases
such as PubChem and ChemSpider, and many of the companies are no longer contactable,
and for whom there is no business value proposition to propagate error corrections.
It should be noted that the hosts of the ChemSpider database developed more stringent
acceptance criteria regarding vendor catalogues soon after the initial release of
the database 28]. Coupled with their pre-filtering efforts and providing direct feedback to the vendors
themselves to encourage clean-up of their data has resulted in improved data quality
not only in later depositions into ChemSpider but likely also for the community in
general, but this is an ongoing effort.

Open notebooks

From a data quality perspective, the most promising property of open lab notebooks
is their directness. The term refers to a specific kind of electronic lab notebook
that is made openly available to the scientific community shortly after the experiment
writeup is complete, circumventing the usual lengthy publication cycle and any proprietary
access restrictions. Typically a data unit, whether it be a reaction, a measurement,
or a characterisation analysis, is prepared and released directly by the scientist
who performed the experiment, and in some cases a second opinion is provided by a
principal investigator or reviewer (albeit with a much faster turnaround time than
for conventional publication, and also foregoing the requirement of novelty). This
means that not only is it possible to find out the individual and organisation responsible
for the contribution, but it also introduces the opportunity for the experimental
scientist, whose knowledge at that point exceeds that of anyone else on this specific
piece of data, to verify that the transmission of the data was carried out correctly.

A pertinent example is ChemSpider SyntheticPages (CSSP) 32], which is an online “micro-publishing” site serving chemists interested in chemical
syntheses. Chemists are encouraged to publish the details of their experiments in
order to communicate the details of their work. CSSP uses a template-based entry form
and multimedia support including interactive display of various types of analytical
data. The articles are reviewed by the CSSP editorial board, made up of university
professors, as well as then being peer-reviewed by the incorporation of public commentary
post-publication. The pre-publication peer review is generally very fast (24–48 hours)
and even post-publication edits can be made as CSSP is a hybrid publication-database.
Each micro-publication includes a digital object identifier (DOI) making the CSSP
contribution a citable object on a CV.

It is the scientist-to-Internet transmission step that we believe is in most need
of attention. Most chemists working to produce new knowledge in experimental laboratories
are not trained cheminformaticians, and have a strong tendency to follow currently
accepted best practices for documenting their results. At the present time, this typically
involves using off-the-shelf documentation software, such as the ubiquitous Microsoft
Word and Excel, and software such as ChemDraw or ChemDoodle that is specially designed
to help chemists create graphics for incorporation into such general purpose packages.
Unfortunately the use of these software tools all too often makes correct machine
interpretation of the data impossible: even in cases where data from drawing packages
is available, the reality is that these tools are designed for creation of diagrams,
not machine interpretable data, and there is no guidance as to which visual aids are
completely and unambiguously meaningful to an algorithm. There are documented standards
for visual representation 33],34], but there has been little effort to implement these for the purpose of lossless
interconversion between presentation and informatics.

Much attention of late has been given to modern online collaborative tools, such as
using Google Docs to coauthor and share content, and for using electronic lab notebooks
(ELNs) with a blog-like interface 35]. While excellent for sharing data in real time, they do nothing to solve the problem
of machine interpretability of chemical data. Freeform text and uploading of arbitrary
supporting files gives the maximum scope for scientists to describe their experiments,
but it is also the worst case scenario for creating a fully automated script to gather
diverse data into a single collection of relevant content in order to provide actionable
intelligence. Some progress in terms of checking data formats is being made by the
utilisation of chemistry specific components into the Labtrove platform 36].

There are other approaches, such as disciplined use of spreadsheets for the purpose
of producing comma/tab-delimited text files, where structures are represented as SMILES
or InChI strings, or database identifiers. Consider the example shown in Figure 3, which shows several molecules from a series of molecules created as potential tuberculosis
inhibitors, all of which are based on a common scaffold template, shown in 3(a). The
structures used in the dataset, shown in 3(b), are drawn in a way that makes the common
structure evident to a chemist. Were these structures to be converted into a line
format, such as SMILES or InChI, in order to cram the data into a text file or spreadsheet,
the nuances of layout and orientation would be lost: the re-depicted structures, shown
in 3(d), are oriented randomly and do not reveal their commonality to the perceiving
chemist, at least not without some careful study. For certain purposes, the structures
in 3(b) and 3(d) are the same (e.g. looking up in a database or calculating properties),
but for purposes of communicating the information to chemists, they are quite different
when considered as a collection. Re-derivation of scaffolds and orientation can be
attempted after the fact, but it is far more advisable to avoid destroying the information
in the first place 37]-39]. Also, changes in atom order, Kekulé/resonance form, treatment of implicit hydrogens,
standardisation of functional groups or tautomers, and failure cases when using molecular
bonding arrangements that are outside of the domain of the representation, means that
often what should have been a commutative operation results in a structure that has
many more points of difference than just layout.

Figure 3. A selection of compounds (b) based on a common scaffold (a). Canonical SMILES strings are shown in (c), and their re-depicted structures shown in (d).

The pitfalls of using line notation for structures also apply when using database
references in lieu of structure, which is an option when all of the entries are known
to already be in the database, e.g. using ChemSpider ID codes, but this is equally
destructive of data. A public database will typically have one globally preferred
structure representation which has been normalised and drawn in some preferred manner
which is not necessarily most appropriate for the task at hand. Using external identifiers
also introduces a slew of additional issues, e.g. if access to the Internet resource
is interrupted, the data becomes unusable until it is restored. Also, records can
be changed or deleted, and there is always the possibility that the database provider
may one day cease to provide the service at all.

Besides issues with structures, there are many other reasons to avoid using an overly
simplified text representation when properties are being included. For example, a
comma delimited text file might contain a heading column described as “IC50”, and
for which each following row has a number. It is pervasive practice to omit any further
information such as units, errors, target, sample size, conditions, etc., which means
that data stored in these files contains an enormous amount of implied context. If
the data is being shared between two scientists working on the same project this may
not be an issue, but if it is being uploaded into an aggregate dataset for purposes
of machine learning or database reference, it is worse than useless, due to its lack
of provenance and context.

Data formatting

When releasing open chemical data in a format intended to be parsed by both humans
and machines, there are many options for file format encodings, but few that meet
all the necessary requirements. Data that is being associated with one or more chemical
structures needs to begin by taking into consideration the ability of the selected
file format to preserve all of the relevant nuances of a chemical drawing. There are
many details to consider, but at the most basic level, there are two important questions
that should be asked:

1. Can an algorithm correctly and unambiguously determine the molecular formula?

2. Is it possible for software to use the representation to create a diagram that
reflects what the scientist originally drew?

There are surprisingly many chemical structure formats that are unable to guarantee
that an algorithm can determine the correct molecular formula from the drawing. These
shortcomings have mostly to do with implicit hydrogen atoms and inline abbreviations.
The implicit hydrogen problem is a side effect of chemists’ shorthand, and works well
in simple cases, but poorly defined valence rules for unusual bond types, and the
absence of a common method for overriding the default formula, means that many nontrivial
molecules cannot be drawn using the most popular cheminformatics file formats, such
as MDL Molfile 40]. Abbreviations are also a persistent problem, since many structures are difficult
to represent in a human-readable way without abbreviating certain groups, but since
there is no universal repository for abbreviations, and many research groups invent
their own overlapping sets for localised use, it is necessary to have a way to define
these internally as part of the structure definition. Additional problems are introduced
when using drawing software designed for diagram creation, which offers a large variety
of drawing primitives that have no meaning at all (e.g. circles, symbols, free form
text, etc.). These file formats are a superset of the collection of meaningful objects,
and they cannot be used unless the operator has a strong understanding of which objects
are valid and which are not. This information is not generally known to experimentalists,
and not communicated by the drawing software.

The need to recreate a diagram with the original layout, orientation, wedge-bonds,
resonance patterns and various other nuances rules out the use of any popular line-based
formats that exclude atomic coordinates. There are many advocates of the use of SMILES
or InChI codes for raw structure representation 41], mainly because they are convenient for storing in spreadsheets or text files, and
have intrinsic canonical properties. Both of these features enable limited use of
chemical data by general purpose software that has no cheminformatics capabilities,
which is often a necessary evil for data manipulation. However, the amount of data
destruction involved in converting a 2D sketch into a short canonical string is highly
detrimental to data integrity. As long as chemically aware software is available,
there are no advantages to using canonical strings to represent structures, since
these can be derived on demand from the original representation, which creates a break-even-or-lose
scenario: for this reason it should not be done unless there is no alternativeb.

One of the chemical structure formats that meets the objectives of data integrity
more effectively than most is that used by the SketchEl42] open source chemical structure editor. The format is designed to capture 2D sketches,
but takes an extreme minimalist approach to its core datastructure 43] which enumerates a list of atoms, and a list of bonds, as indicated in Table 1. Notable features are the inclusion of the zero-order bond and the option to control
the number of virtual hydrogen atoms associated with each actual atom 40]. Using these simple additions, it is possible to describe any molecular species and
ensure that the molecular formula is correct, regardless of how exotic its bonding
arrangements are, and for the most part it is possible to devise a reasonable representation
of the atom valence states.

Table 1. Atom and bond properties, and currently reserved extensions, used by theSketchElmolecule format

In spite of its minimalism, the format is also extensible in a way that is both forward-
and backward-compatible. A number of properties are optional: these, properties that
will be defined in the future, and custom properties that are not a part of the formal
specification, are stored in a way that preserves the read/modify/write integrity
for algorithms that do not care to implement them. This is in contrast to formats
like MDL Molfile: if a software package writes a molecule that makes use of a property
that is not part of the lowest common denominator subset that most implementations
can handle, or defines its own extensions, the extra data will be either deleted or
corrupted if it is submitted to a software algorithm that does not implement the property
correctly.

One of the optional extensions that can be implemented is the inclusion of inline
abbreviations. For most cheminformatics formats, if the user wishes to represent a
common abbreviation, e.g. the use of Ph to represent phenyl, the atom name is simply given the text “Ph” instead of a valid element. If all goes
well, the cheminformatics software that is interpreting the structure will have its
own special lookup list which recognises this shorthand notation. This approach is
haphazard at best, and the SketchEl format solves the problem by defining abbreviations inline. Figure 4 shows two approaches to representing bromobenzene: in (a) the 7 heavy atoms are drawn
out in full, while in (b) the phenyl (C6H5) fragment is written as a single node with a text abbreviation. If the abbreviation
atom is created in the same way as for an ordinary element symbol, saving the structure
as an MDL Molfile would be equivalent to the text shown in (c). (It should be noted
that while the MDL Molfile specification does include a specification referred to
as “S-groups” which can be used to achieve a similar effect as inline abbreviations,
this file format suffers from the fact that it has literally thousands of implementations
in current use across the industry, and that most of these implementations only make
use of a small lowest common denominator subset. Because the format is defined in
a way that makes the the load/modify/save workflow destructive of unsupported properties,
any property that is outside of the commonly implemented subset is frequently at risk
of being deleted or corrupted. Making use of features such as S-groups substantially
reduces the number of software packages that are compatible with the data). This encodes
for a 2-node structure, of which just one is an element symbol, and so the chemical
meaning would be invalid to any parser that did not implement special meaning for
the “Ph” symbol. A well specified SketchEl file is shown in (d), which uses the inline abbreviation form to explicitly include
the fragment to which “Ph” corresponds, i.e. the phenyl ring itself, and the attachment
point. The sub-fragment definition contains all of the information necessary to display
or recreate the fragment, and hence represent the structure with all of its atoms,
and infer the correct overall molecular formula.

Figure 4. Bromobenzene, drawn in full (a) and with an abbreviation (b). The Molfile with the plain text abbreviation is shown in (c), while the SketchEl representation with the abbreviation encoded inline is shown
in (d).

The name of the abbreviation is defined in lieu of the element symbol, and the structural
fragment is defined completely within the extension field. Software that does not
implement the abbreviation mechanism can read, view, modify and write SketchEl molecules without destroying information, while abbreviation-aware software can provide
value added functionality, such as optionally showing the full structural definition
by request, or providing convenient ways to select from currently available abbreviations,
or subsuming existing structural fragments as new abbreviations or molecular formulae
(e.g. Ph could also be represented as C6H5). Abbreviations can also be nested, e.g. a fragment defined as Et3Si may expand out initially to a silicon atom with 3 ethyl “atoms”, which are themselves
abbreviations for C2H5. Figure 5 illustrates the levels of abbreviation: (a) is drawn out in full, with each heavy
atom represented as its own node, encoding for the molecular formula of C6H16Si; (b) collapses each of the ethyl groups into a single abbreviated composite fragment,
each node of which has the implied molecular formula C2H5; (c) represents the whole silyl fragment as an abbreviation-of-abbreviations, compiling
the 3 ethyl groups and the silicon atom from (b) into a single meta-abbreviation.

Figure 5. Three different representations of triethylsilane, using different degrees of abbreviation.

While the SketchEl format cannot intrinsically represent many higher orders of metadata, e.g. mixtures
of compounds with different stereochemistry or constitutional isomers, these definitions
are difficult to pinpoint with a single proto-structure in a way that is not ad hoc:
a more rigorous approach is to define a higher layer of abstraction, which either
enumerates the different molecular species explicitly, or in complex but regular cases
such as Markush collections, defines its own enumeration formula.

Assembling collections of molecular structures brings a similar capability wish list:
a format should be minimalistic, simple to define, easy to implement, and most importantly
it should be forward and backward compatible, so any implementation of the specification
can read, view, modify and write the data with reasonable assurance that important
content will not be destroyed or corrupted. Since the most common use case scenario
for multiple structures and data involves representation in a tabular format, roughly
analogous to a spreadsheet with a molecular datatype, it makes sense to define the
core data format in this way, whereby more exotic data arrangements are mapped onto
the table with higher order metadata.

These characteristics are implemented by the datasheet XML format, which is also used
by the SketchEl package, for editing collections of structures and data. At its core it is a very
simple tabular format, where each column is strongly typed, and is one of molecule, string, integer, real number, or boolean. Molecules are embedded using the SketchEl molecule format. For many data collections, e.g. lists of molecular structures with
their associated identifiers (name, link, database ID, etc.) and properties (activity,
solubility, melting point, etc.), the core format is quite adequate.

For additional labelling, or for higher order organisation such as arranging multiple
structures into reaction schemes, the datasheet header allows for extensions, which
contain arbitrary content that is generally not shown to the end user. Extensions
that follow a specific protocol are described as aspects. The principle of the aspect extension mechanism is that if a software application
implements the aspect, it should provide additional capabilities, such as alternate
viewing modes, specialised editors or additional classification information. An aspect
is required to be as tolerant as possible of disruptive external modifications, e.g.
if one of its necessary columns is deleted, it will recreate the column using default
values. If the software application does not recognise the aspect, it should still
be able to load the datasheet, present it to the user in its default tabular form,
allow cells to be edited, rows to be added or deleted or moved, and in some cases
modification of column names and types, without necessarily disrupting the higher
order markup. For example, if an aspect defines the default units for a given column,
loading the datasheet with an unaware editor and modifying the quantity values preserves
the read, view, modify, save integrity, as long as the user is aware of what the numbers
represent. If an aspect defines a chemical reaction, where a number of molecule columns
are used to define the various components, it is possible to use a minimal editor
to change some of the molecular structures, and still preserve the reaction definition.

Figure 6 shows several examples of datasheet aspects that are currently in use. Figure 6(a) demonstrates the Solvent aspect, rendered using the Green Lab Notebook app 44]. The aspect definition is quite minimal, and merely suggests more visually informative
ways to display several environmental and physical properties. Molecular datasheets
containing this aspect can be effectively viewed and edited with a tabular spreadsheet-like
editor that does not implement the aspect, which is shown in Figure 7. Figure 6(b) shows an example of the SARTable aspect, as rendered by the eponymous SAR Table app 45]. This aspect definition adds explicit definitions to distinguish columns of scaffolds and substituents, and the composite molecule that is formed by grafting them together, as well as additional information about
properties regarding their units and operating range. As with the Solvent aspect, an unaware molecular datasheet editor can be used to effectively view and
modify such data, as long as the operator takes some care to respect the column layout,
but will not benefit from some powerful editing capabilities, such as scaffold detection
or automatic recomposition of the molecule each time the fragments are changed.

Figure 6. Some of the datasheet aspects currently in use: (a)Solventaspect (displayed by the Green Lab Notebook app), (b)SARTableaspect (displayed by the SAR Table app), (c)Reactionaspect (displayed by the Mobile Molecular DataSheet app) and (d)Experimentaspect (displayed by the Green Lab Notebook app).

Figure 7. Using the open sourceSketchEleditor to view and modify a datasheet that has an embeddedSolventaspect, which can be done safely and conveniently even though theSketchElapplication does not implement the aspect.

Figure 6(c) and (d) both show renditions of aspects that encode for chemical reactions. The
Reaction aspect, shown in (c) rendered by the Mobile Molecular DataSheet app 46], combines a number of columns containing molecules, text and numbers into a single
reaction step, which is ideally rendered as a single graphical object representing
multiple components. The Experiment aspect, shown in (d) rendered by the Green Lab Notebook app, is a subclass of Reaction, which augments the definitions with additional information such as quantities and
component roles, and also allows for multiple steps. While datasheets containing the
Reaction or Experiment aspects can be viewed and edited by software that does not recognise the aspects,
the meaning of the content is less clear to the viewer, since a spreadsheet-like representation
is quite different to the reaction component layout algorithms used by the software
that implements these aspects.

Chemical data that is stored using the datasheet XML format, with embedded SketchEl molecules, conforms to a very well defined, easy to implement core specification
for data purity. Both molecules and datasheets can be extended as necessary to describe
more complex concepts, which are often necessary to ensure machine readability, but
their core definitions are highly functional, and generally safe to edit without specific
knowledge of higher order markup.

Data entry

One of the valuable properties of open data is the close connection between the scientist
and the content, which can be treated as an opportunity to solve some of the most
pernicious data quality issues in chemistry. When it comes to aggregated collections
of molecular structures, there are two main kinds of problems: structure representations
that are demonstrably wrong in the absolute sense, and those which could be correct,
but in the given context, the wrong chemical species is being described.

A significant amount of work has been invested in the former category, for example
the ChemSpider Validation and Standardization Platform (CVSP) 47] (also see: Karapetyan K, Williams AJ, Batchelor C, Sharpe D, Tkachenko V: The Chemical
Validation and Standardization Platform (CVSP). Large-scale automated validation of
chemical structure datasets, accepted for publication to Journal of Cheminformatics).
This tool embodies chemical knowledge that can search for a number of common structure
mistakes, or representations that do not follow a protocol, such as covalently bound
salts (Na-Cl) or pentavalent nitro groups. Many of these examples are common and easy
to fix, but there are many more examples that cannot be corrected without knowledge
of context. Furthermore, there is no way to ascertain whether a molecule is the right
one when there are multiple reference points. For example, a data record that provides
a bioactivity measurement for a molecule named “aspirin”, for which the structure
given is salicylic acid, even a smart algorithm that is able to find out that aspirin
is the acetylated form cannot know whether the data record provided the wrong structure
or the wrong name, unless the provenance is somehow recorded. Whether the molecule
or the name should be trusted preferentially, and if there are conflicts within either
of these, which source has precedence, means that each data collection needs a complex
and elaborate policy for judging data quality. These challenges have directly influenced
some of the approaches associated with the Open PHACTS semantic web project 48] where a “chemical lenses” approach has been utilised to focus the user in on various
forms of the chemical 49].

Proactive “fixing” of structures is capable of doing more harm than good. Consider
the ferrocene derivative shown in Figure 8: this compound has been imported into PubChem from several sources, one of them being
the NIST WebBook 50]. The structure representation used by the NIST source is shown in 8(a): it is an
admirable attempt to work within the constraints of the commonly implemented subset
of the MDL Molfile format, using only simple bonding types. While the rendition fails
to capture the aromaticity of the cyclopentadienyl ligands, or the oxidation state
of the metal, it successfully represents all of the points of connection between the
metal and the organometallic fragments, and does so in such a way that the valence
states of the carbon atoms add up to the correct number, which means that even the
most trivial implicit hydrogen counting algorithm will infer the correct molecular
formula. Consider, however, the post-processed structure that is incorporated into
PubChem, shown in Figure 8(b), which is obviously broken. The PubChem molecular structure representation has
a larger alphabet of bond types than that which is in common use by most software,
which can in principle be used to good effect, but in practice this is not necessarily
the case. As can be seen, the processed form adds 6 negative charges to a molecule
that was (and should be) overall neutral, and produces a seemingly arbitrary single/double
bond pattern within the 5-membered rings. While it may seem unfair to use an example
of a molecule that is difficult to represent using fundamental cheminformatics primitives,
the point is that using an automated structure correction algorithm is not a same-or-better
proposition: such algorithms can and do break structures that were originally correct.

Figure 8. (a) Original drawing of ferrocene carboxylic acid using a limited alphabet of bond
types (CASRN 1271-42-7); (b) modified structure after automated processing (PubChem
ID 11986122).

Nonetheless, the tools for ensuring that a molecule is both valid and standardised
according to a set of rules is extremely valuable when incorporated into the editing
workflow, i.e. at the source of entry when user intervention is still an option, rather
than using automated scripts further downstream. Violations can be highlighted during
the editing of a molecular structure, and flagged again if the user attempts to submit
the entry to a database.

The adherence to hard rules for structure validity is often appropriate for processing
large databases with preexisting quality issues, but a heavy-handed approach is not
appropriate for original data entry. For example, when a chemist draws a pentavalent
carbon atom, it is usually a mistake, and software that can call attention to this
likely mistake as early as possible is beneficial. Nonetheless, there are real reasons
for representing such species (e.g. carboranes 51]), which raises an important point that is sometimes lost in software: the originating
scientist is always the final arbiter. There is usually at that time nobody in the world who knows more
about the particular unit of research being described, and the rule set designed for
a particular software package is far down the list of contenders.

Besides calling attention to what appear to be actual problems with structures, the
advice that an editing tool should be providing is real time feedback on how the structure
would be interpreted by an algorithm in its current state. The number of drawing conventions
in common use by chemists, which are enabled by many chemical drawing packages, but
have no broadly accepted cheminformatics interpretation, is enormous. Examples can
be found throughout all kinds of synthetic chemistry literature. Figure 9 shows two common examples with simple organic compounds: in (a) the chiral centers
are denoted using the (R) and (S) labels, which chemists often draw using plain text
labels when preparing manuscripts. These annotations are meaningless to machine algorithms,
and so the structure would be interpreted as having two unresolved chiral centers,
which is not what is being represented. In (b), a common shortcut for pictures is
used: the X and R abbreviations together encode for a total of 4 different chemical
species. Even if the cheminformatics software were able to ascertain that the non-element
labels are placeholders for fragments, the way in which the ortho/para isomerism is drawn makes the direct interpretation of the template a mixture of two
separate connected components – C6H5X and CH3R – which happen to have an overlapping bond. An algorithm would have to not only
use text mining to find out the composition of the fragments, but also further interpret
the label and other contextual clues in order to determine that the intended connectivity
is quite different to what is literally stated.

Figure 9. Two descriptions of organic compounds that are unlikely to be understood by cheminformatics
algorithms: (a) plain text annotation of chiral centers; (b) mixture of compounds
with varied connectivity.

While such underappreciated problems with organic compounds are prevalent throughout
the literature, the problems become far worse when venturing into inorganic and organometallic
chemistry. One of the archetypical examples of the failures of cheminformatics software
is the case of tin (II) chloride. Most chemists would expect to be able to draw the
structure shown in Figure 10(a), and to be able to rest assured that any software algorithm would understand that
this represents a species with the molecular formula of SnCl2. There is, however, a very high likelihood that an interpretation algorithm will
apply valence rules to calculate the number of implied hydrogens and, knowing that
tin is in group 14, treat it the same way as for silicon or carbon, which means that
the perceived structure is as shown in Figure 10(b): H2SnCl2. This is most likely incorrect, but the hydrogenated form is still a legitimate structure,
albeit difficult to handle in the laboratory. One might be tempted to solve the problem
by not automatically adding hydrogen atoms to tin, but this runs the risk of breaking
interoperability with other software. For example, if the user had drawn methyl groups
instead of chlorine atoms, the most probable structure that was being referred to
is (CH3)2SnH2, i.e. the organotin (IV) compound, for which implicit hydrogens should be added. This is a pertinent example, because for tin halides, the divalent forms
tend to be stable compounds, and the hydrogen-saturated form is generally highly reactive,
whereas this pattern is reversed for organotin analogs, for which carbene equivalents
are usually only observed as fleeting intermediates. This kind of chemical knowledge
that can often be assumed of human chemists is almost never taken into account by
cheminformatics algorithms, and nor should it be, given that the scope for complexity
is effectively infinite. As with all such ambiguities, the solution is to properly
define the rules, and include ways to explicitly indicate any exceptions. Figure 10(c) uses 3 different ways to solve the ambiguity for tin (II) chloride: indicating
that there are 2 valence electrons not used for bonding; specifying the oxidation
state; and explicitly marking the number of virtual hydrogens as being zero. Any one
of these methods will suffice in this example.

Figure 10. Tin (II) chloride, (a) drawn naively; (b) interpreted incorrectly; (c) redundantly
over-specified.

Figure 11 shows a structure that is difficult to draw by any definition: an iron dimer of a
commonly used synthetic fragment. The drawing in 11(a) shows a chemist-friendly graphical
representation, which is suitable for publication in manuscripts 52]. As drawn with most chemical structure drawing tools, this has a number of problems.
The cyclopentadienyl ring is drawn with a ring, which is a graphical object that has
no meaning to cheminformatics, so rather than defining the fragment as C5H5 (with an extra electron), it is literally interpreted as cyclopentane, i.e. C5H10. The line emanating from the metal to the center of the ring is interpreted as a
methyl ligand, rather than an ?5 multicentre attachment. The use of the wedge bond to denote a 3D effect does in this
case conflict with its use in cheminformatics to denote chirality, and may be stripped
out or used in some way to label a chiral center that is not real. The terminal carbonyl
ligands are encoded as “CO” and “OC” respectively, which are not elements, and so
are meaningless. All of these issues are quite unimportant for human-to-human communication,
since the mapping between these drawing conventions and the corresponding claim about
the chemical reality are well understood by organometallic chemists, but the absence
of a corresponding mapping from graphics-to-cheminformatics means that these diagrams
are unsuitable for use by machines. Figure 11(b) shows an alternative representation of the same structure, which is restricted
to use of a graph containing only atom nodes and bond edges. The use of zero-order
(dotted) bonds means that the atoms are all represented with a reasonable valence
count, all atom-to-atom connections have an explicit bond edge, and that the structure
distinguishes between dative and anionic ligand types. The atypical bonding of the terminal carbonyl ligands is represented
by explicitly indicating a carbene-like lone pair on the ligand atom, which is one
way to ensure that implicit hydrogens are not erroneously added.

Figure 11. Two representations of cyclopentadienyldicarbonyliron dimer: (a) diagram style preferred
by chemists; (b) a more fundamental representation that does not mislead cheminformatics
algorithms.

While the second structure diagram in Figure 11 is much less appealing for purposes of manuscript preparation, this illustrates the
primary argument of this article: imagine two algorithms, one designed to automatically
convert from the human-friendly format (a) to the machine-friendly format (b), and
the other to perform the opposite. If both algorithms have a comparably high but imperfect
success rate for a given domain (e.g. 99%), it is overwhelmingly preferable to use
the machine-friendly format for the primary repository, because of the asymmetry of
consequences. When a structure drawn for humans is parsed incorrectly into a machine
format and injected into a database, all too often the error goes unnoticed, and if
the provenance is not retained, then the corrupted data will surely find its way into
the body of scientific knowledge and continue on to befoul any and all data processing
operations that it comes into contact with. If on the other hand the data is represented
in a machine-friendly way, and algorithmically converted into a human-friendly graphical
format as needed, the consequences of failure are minor. For high quality uses such
as manuscript preparation, rare flaws will generally be noticed and can be corrected
easily enough, since literature publications are carefully scrutinised by several
reviewers prior to publication. Even if a sub-optimal drawing is published, as long
as it is correct, the fallout is likely to be manageable. For low quality uses, like
browsing search results from a database query, occasional representation of structures
in a way that is correct but not aesthetically ideal is a small nuisance, compared
to data corruption.

As well as such valence issues, a large category of data entry issues arise from the
use of text. As a general rule, any text in a structure diagram that does not map
to an element in the periodic table brings with it an additional burden for ensuring
that its meaning is strictly defined. Free text, e.g. a label that says “chiral” or
“cis/trans”, is clearly not applicable, but as mentioned earlier, abbreviations can
be dealt with by ensuring that they are defined within the chemical structure – though
not with free text labels such as “L?=?PPh3”. Other kinds of abbreviations, such as X, R, R1, R2, etc., serve as element placeholders
and their presence implies that the representation is of a template, rather than a
structure. It is important to ensure that these structures are never submitted to
a database without some accompanying formula that specifies how the template should
be converted into actual chemical species, but fortunately this is easy for data curation
algorithms to detect, and reject due to missing information.

One of the best visual aids for educating scientists about what their structure diagrams
actually mean to a machine algorithm is simply to display the computed molecular formula
for particular fragments. This is a concept that is deeply ingrained for all chemists,
regardless of their level of affinity for cheminformatics software. If the example
in Figure 11(a) is reported as having a formula of C14H28Fe2[CO][OC], the chemist does not need to be convinced that there is a problem: this is ot the chemical composition of the molecule, which means that knowingly submitting this
representation to a database is tantamount to scientific fraud, and therefore something
must be done about it.

The most effective way to ensure that structures are represented accurately is to
use data entry tools that operate on a fundamental datastructure, such as the SketchEl molecule format, or an enhanced variant of the industry standard MDL Molfile 40]. Using graphical diagram drawing tools is problematic, because the functionality
they provide is a superset of what is valid for cheminformatics purposes, and there
are no algorithms that can transform an aesthetically styled structure into a machine
readable valid equivalent with a success rate that is acceptable. In principle, though,
it may be effective to create a plugin for such software to show the machine interpretation
and structural formula breakdown, updated in real time, in order to ensure that users
are aware of when their stylistic choices result in misleading content, but such tools
are not currently available. This represents a potential unmet need.

Case studies

Direct data hosting

There are many services that store user-provided chemical data in a fundamental cheminformatics
format, including the aforementioned ChemSpider and PubChem databases. These services
make use of elaborate content aggregation features, which involves a large amount
of automated correction. For most organic structures that conform to simple Lewis
octet rules this can be trusted to leave a well-drawn structure unmolested, but problems
arise when leaving this domain.

The molsync.com site 53] provides an example of a service that openly hosts chemical data, in its most pure
form, and allows it to be consumed in a variety of different downstream formats, for
either humans or machines. We describe some of the properties of this service, because
it differs from most ad hoc Internet sharing facilities in that it provides interpretation
and visualisation of raw chemical data. It demonstrates several key proof of concept
features that should be standard for chemical data hosting, and can be incorporated
into open lab notebook software.

Data can be uploaded using a simple REST-based API, after which point it is stored
in a database and assigned an identifier. The data is typically uploaded as either
a SketchEl molecule or datasheet XML document, but other related formats such as MDL Molfiles
will be automatically converted. From the identifier, a URL can be constructed, which
allows anyone with an Internet connection to be open the page in a browser.

In browser mode, a standard HTML5 page is created. The web service fetches the underlying
raw data, and creates the tabular structure shown to the user, as shown in Figure 12(a).

Figure 12. Sharing chemical data using themolsync.comservice, which stores the raw datasheet with any applicable aspects. The default (a) view is an HTML5 page, using resizable vector graphics, which can be downloaded in
a variety of informatics or customised graphics formats (b).

The outline of the browser presentation uses simple HTML and CSS. Individual molecular
structures, and some of the higher order metadata specified by aspects that are implemented
by the server – such as chemical reactions – are drawn using a high grade rendering
layout algorithm 54] and passed to the front end as vector drawing instructions, which means that the
page can be rendered to any resolution, and also can be sent to a printer or converted
into a PDF file without any loss of quality.

Data exporting

There are two main use cases for data conversion: migrating to a different cheminformatics
format to be consumed by a specific software application, and the generation of graphics
for presentation or publication purposes.

When raw data is stored in a rigorously minimalistic and unambiguous format, it is
generally effective to convert this data into the lowest common denominator subset
of a less rigorous format, with some potential for information loss, which may remain
theoretical for a reasonable domain of use cases. For example, converting a structure
into a V2000 MDL Molfile that is readable by the large majority of software that can
parse the format can be expected to preserve all of the pertinent information in many
molecule types. For nonorganic structures that cannot be properly represented with
the V2000 format, or for structures that use inline abbreviations, the conversion
cannot survive a round trip intact, and so the conversion is an irreversible downstream one. For information that is pertinent to the destination format, but does not exist
in the core specification of a SketchEl molecule, the extensibility mechanism holds the door open for future improvement,
in a backward and forward compatible way. For example, MDL Molfiles provide a number
of capabilities for specifying chemical queries 55] as atom and bond annotations. The SketchEl molecule format can optionally incorporate analogous extensions, and if the data
hosting service is subsequently upgraded so that it can convert the overlapping subset
of functionality to the MDL Molfile equivalent, then this capability can be introduced
at any time. The operation is commutative to the extent that the definitions match.

Similarly with collections being exported as MDL SDfiles, a significant amount of
metadata is lost, particularly regarding the columns and types, and so it cannot always
be assumed that an upstream conversion will preserve all of the original data. Other destination formats have
more interesting caveats. For example, the Chemical Markup Language (CML) 56] is for all practical purposes a superset of all possible chemical formats, since
additional tags can be introduced by any writer without affecting validity, which
passes the interpretation problem down the line: there is no guarantee that other
software will understand the choice of properties, meaning that interoperability is
very low.

Converting a rigorous, minimalistic cheminformatics format into manuscript quality
graphics is not a simple task. Because high level aesthetic style information has
no place being stored in the core definition of a datastructure that is intended to
describe the chemistry in a way that is understandable to machines, it means that
the rendering process involves the creation of a lot of additional information, namely
the positioning for each of the labels, bonds and various other annotations 54]. While the loss of layout cues in the core datastructure is unfortunate in the case
of structures that were originally imported from a drawing program that allowed the
user to specify such preferences, it does mean that all structures are created equal
as far as visualisation is concerned, as long as the 2D coordinates and wedge bonds
for each of the non-virtual atoms are chosen to suit. Since many structures are partially
or completely composed using algorithms, rather than being hand drawn, it is highly
beneficial to be able to create high quality diagrams without additional user intervention.
One alternative to insisting on algorithmic recreation of aesthetic properties is
to store layout hints (e.g. atom colours, charge positions, etc.) as optional non-fundamental
extension properties.

As with cheminformatics formats, there are a number of graphics formats to choose
from, and the most appropriate of these varies depending on the destination. The most
universally recognised format is the Portable Network Graphics (PNG) format, which
is a bitmapped format. Until recently this was the only practical method for displaying
custom graphics on a web page, but has major limitations, e.g. the resolution has
to be selected prior to generating the page, as well as a litany of other inconveniences.
All too often manuscripts created with wordprocessing software incorporate bitmapped
graphics, and these need to be generated at a much higher resolution than what is
suitable for screen viewer. A document with screen-resolution bitmapped graphics appears
shoddy when zoomed to a non-default resolution, and frequently almost illegible when
printed or converted into a PDF file, which describes both of the primary use cases
for manuscript preparation. Since molecular structures are inherently vector diagrams,
being originally composed by the software using a small dictionary of shapes: lines,
circles, curves, etc., it is strongly preferable to represent the drawings in a vector
format, which ensures that they can be rendered as perfectly as the device allows,
whether it be a screen, a printer, or a print-ready file format like PDF. There are
a number of vector graphics formats to choose from, and these include Scalable Vector
Graphics (SVG), Encapsulated PostScript (EPS), and embedded graphics formats like
DrawingML, which can be used to compose vector diagrams inside Microsoft Word, Excel
or PowerPoint documents.

The reason for taking the approach of storing chemical data in the most rigorous cheminformatics
format, and converting it on demand, is that functionality can be provided as it is
needed. Because the data is stored in a format that is understandable to an algorithm
at a fundamental level, it can be converted into any format that the service is currently
capable of creating, and taking into account the needs of any aspects that have currently
been implemented by the service. Figure 12(b) shows the dialog that is presented when requesting the downloading of a datasheet
with an embedded Experiment aspect. The list of available formats includes several informatics formats, and number
of different ways to render the content as graphics which can subsequently be used
by other presentation packages, including Microsoft Word format with embedded vector
diagrams. The combination of machine-readable raw data and a chemistry aware service
has two clear advantages over storing pre-prepared files in several formats: additional
output formats can be added at any time, and the user is also given the option to
customise the output, e.g. by selecting the resolution and colour scheme for molecular
graphics. This approach satisfies the needs of machines and humans.

Sharing

The Internet provides a seemingly limitless menu of ways to share information across
the globe, and most of them can be adapted to chemistry in some way, but other than
approaches such as that taken by molsync.com, these seldom have the ability to form a strong association between the machine interpretable
data and the human viewable rendering thereof. For example, a user can easily use
Twitter to share a graphical picture of a molecular structure, but since this is just
a bitmapped image, to a machine it is largely indistinguishable from a photograph
of a kitten. The data only regains its full value if an individual human redraws the
structure using a chemical drawing package (or attempts to parse it with a bitmap-to-structure
conversion tool).

On the other hand, if a user posts chemical data using molsync.com, or links to a verified entry in a database like ChemSpider or PubChem, or any other
link that can be resolved to a download that advertises a chemical MIME type 57] it may well be crawled by information extraction services. For example, the Open
Drug Discovery Teams (ODDT) project 58] continuously scours Twitter feeds looking for specific hashtags that relate to its
collection of topics, which mostly pertain to rare and neglected diseases, and other
precompetitive scientific topics (e.g. green chemistry and drug repurposing). Tweets
with links are harvested and added to a database, but those which have resolvable
chemical data are treated specially, as shown in Figure 13.

Figure 13. The Open Drug Discovery Teams app showing some of the covered topics (a) and a detail
view of some of the content obtained relating to the Ebola virus, in particular several
structures of FDA approved drugs (b).

Sharing of machine interpretable data is leveraged from within the ODDT app, and it
is easy to obtain it and incorporate it within a cheminformatics workflow. The data
is acquired in its pure state, and there is no need to reenter it, because no information
was lost during the transition. We have described how mobile technologies can be used
for secure sharing of data prior to open sharing in ODDT 59]. In addition we have shown how ODDT can be used to surface structure activity relationship
(SAR) data from behind paywalls 60] and raise awareness of specific topics 61],62]. Twitter is also a valuable tool for realtime microblogging from scientific conferences
63]: there are an increasing number of scientists who routinely “live tweet” what they
learn during conferences, and there is no reason why digitally accessible data cannot
be incorporated into this stream.

Another novelty feature that the molsync.com service provides is the display of a molecular glyph, which is the equivalent of
a chemical QR code: its role is equivalent to a URL, except that when it is printed
out on paper, e.g. on a poster or a label, it is possible to use the Living Molecules
mobile app 64] to photograph the glyph. Once the payload is extracted, the app is able to go directly
to the source of the data, and download it in its pure form, i.e. it is now loaded
into the app itself, and from there it can be viewed, exported, re-shared or used
in any other way that raw cheminformatics data can. We have shown how this glyph could
be used practically to encode chemical ingredients in consumer products 65].