ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files

Harvesting chemical data from the web is a challenging task requiring several convoluted steps. When chemical structures are stored in truly computable format with atoms and bond matrices (vector format-Cartesian co-ordinates), they can be processed electronically for computational and informatics purposes. However while transforming/storing the files in PDF (Printable/Portable Document/Data Format) that are usually used for the convenience of printing and reading, the valuable and re-usable molecular data is totally lost and buried in scientific literature as documents and seldom used for further computational studies. In earlier days, the hand-drawn molecules in ORTEP diagram formats were published while discussing the 3D conformation of molecules in the research articles. Generation of 3D structures from these molecular images in raster format was extremely difficult. Recently, some efforts have been made to transform computer generated and hand-drawn chemical images from journal articles and patent documents into truly computable molecules for inventory and database applications. Other similar endeavors include transforming either the textual chemical names (common, systematic, corporate identifiers for example CAS Registry number) or the computer generated names into corresponding molecular structures with moderate success. Although the name to chemical structure conversion programs are now routinely being used for harvesting chemical data from documents yet they have been insufficient in generating the accurate and truly computable and re-usable molecular data. The supporting information related to computational methods based research articles, describing the transition states of organic reactions is now available from journal publishers’ websites containing description of computations performed with tables of results, molecular images in 3D conformations along with 3D molecular co-ordinates in a PDF format. This combined data in a single file complicates the harvesting process and development of pattern recognition techniques for selectively excluding the non-atomic co-ordinate information from the pool of large collection of textual data presented as supporting material. Since there are no defined rules and guidelines for submitting molecular data in a supporting document associated with research publications, the authors are free to choose their favorite methods of representing molecular data such as chemical structures and corresponding atomic co-ordinates in the supplementary data file. This freedom of choosing data formats necessitates the development of several pattern recognition templates in the form of regular expressions to handle diverse formats (co-ordinates separated by space, comma, tab etc.) and maintain the order in which the XYZ co-ordinates and atom information is presented by the authors. This study therefore highlights the need for development of standards required for submitting the supporting materials with molecular data in a consistent, truly computable and re-usable format to journals publishing computational research. A specific set of guidelines defined by the publishers to submit molecular data even in a PDF format, would accelerate the automatic processing and recognition of chemical data for further computational studies related to reaction modeling [13], drug-discovery [47] and molecular inventory management [8, 9]. Several standard molecular representations in ASCII format which are easily readable by molecular modeling and chemoinformatics software packages are available. Supporting materials are deposited in PDF format for the convenience of storage, easy manageability and electronic dissemination. The commercial software packages applied for computational chemistry applications employ their own legacy file formats for handling molecular data, the technical details of which are not usually published. From the researchers’ point of view, the published data in re-usable formats would save efforts and time to understand the molecular data better and use it for practicing to carry out further advanced studies in different problem solving environments that require 3D conformation of molecules. Exchange of chemical data between multiple softwares without loss of information is a critical requirement in computational chemistry and chemoinformatics applications. Thus there is a need for the development of tools that can bridge the gap in molecular data translation automatically and accurately from PDF format to truly computable, re-usable format without manual intervention.

In this context, it is pertinent to mention the efforts by Rzepa and Peter Murray-Rust for developing tools to parse chemically relevant thesis and other published articles for harvesting analytical data [10, 11]. Special emphasis was laid on the use of Information Technology (IT) techniques for free re-distribution of electronic chemical data, for instance, storing actual supplementary information in structured XML/CML documents for universal applicability and dissemination of the valuable experimental/computed data thus advancing “data led science” as is the case in biology. The blue obelisk informal group initiative [12], encourages the use of open source data, open standards, shared algorithms and tools for performing chemoinformatics tasks. It has led to the development of valuable tools such as JChemPaint [13], CDK [14] and chemical information systems [15]. Similar efforts have been made by the Cambridge Crystallographic Data Center (CCDC) group that provides easily downloadable crystal structures of organic molecules that are pliant with a number of software solutions for drug discovery [16]. In a recent article, the importance of curation of large chemogenomics data set for building better predictive model for life sciences has been emphasized [17]. During the preparation of this manuscript, a timely research article by Rzepa’s group on granularity model for extracting molecular information appeared [18] that stresses on the need for periodic and automatic curation of data from supplementary information in research articles. The present work is geared towards partial fulfillment of this need for “futuristic research data management”.

Conventionally, chemical names (common, systematic), Chemical Abstract Registry numbers are extracted from the web-pages and transformed into corresponding molecular structures using name-to-structure conversion tools [19], name to structure relational database look-up methods [20], large scale key-value pair list [21], distributed relational database search [22] etc. We have previously employed distributed systems to harvest chemical data using Google API (ChemXtreme) from the web pages [23]. Transforming the raster images into vector graphics followed by identification of relevant pixel information associated with atoms and bonds of a molecule is a cumbersome job [24]. Tools have also been developed to harvest molecular data from images using web camera, scanned images wherein the raster graphics data was transformed into vector graphics to eventually retrieve the atoms and bonds information for the generation of truly computable and re-usable chemical structures such as ChemRobot [25], OSRA [26], ChemReader [27], CLiDE [28], but only limited success has been achieved. A foolproof method with complete reproducibility of computable molecules from images is still a distant dream as the existing methodologies and tools do not provide accurate molecule data after processing. Therefore it is essential to develop efficient tools that can extract molecules from rich sources such as supplementary information files deposited at the journal site. Although spectral, molecular and analytical data have been harvested in the past but extracting molecules directly from author supplied atomic coordinates provided in supplementary materials as PDF format is not known. Accordingly, in the present work, we have developed an application, ChemEngine that reads all the files stored in the PDF format to extract molecular coordinates and generate computable molecular structures. To demonstrate the efficiency of the program, supporting material data files of three different molecular representations in terms of delimiters in the co-ordinate data were selected and the data was successfully parsed using ChemEngine to extract molecular data. It is to be noted here that the first two files from ACS publications did not require permission for data harvesting, while in the third case (RSC Advances), an article published under the CC-BY license was selected. It is also observed that the bulk processing of articles or supporting materials from publishers’ site automatically is usually prohibited due to copyright and article access policy.

Generally every software program dealing with computational chemistry, provides an export format for the computed data either as a plain text or delimited text that can be analyzed, visualized, plotted via common tools like Microsoft excel or other molecular viewers that accept molecules as plain text in simple.xyz formats. However, supporting materials of molecular data files also include brief description of molecules, computed data, plots, page numbers, document information, manuscript bibliographic details etc. as a single document in PDF format that makes harvesting the molecular data extremely difficult as these have to be selectively excluded while parsing the file. In the Fig. 1, only the enclosed text in the rectangular box is correctly recognized using patterns by ChemEngine, the rest of the unstructured text is ignored. Given an input file in PDF format, the program yields three different files in GJF format, text file containing computed bond matrix and all molecules in SDF format. The contents of the non molecular data file can also be utilized by further subjecting it to standard text mining methodologies [29, 30] for retrieving molecule names or other information such as list of basis sets employed in the specific computational work.

https://static-content.springer.com/image/art%3A10.1186%2Fs13321-016-0175-x/MediaObjects/13321_2016_175_Fig1_HTML.gif
Fig. 1

Supplementary data of a journal article (case study I) depicting the computed molecular data format, the contents in the highlighted text are required for the re-computation of data. A1, A2, B1, B2 refer to text patterns in the specific document. The crossed out text in red color is ignored while generating the coordinate file by ChemEngine version 1.0