S2O – A software tool for integrating research data from general purpose statistic software into electronic data capture systems

Electronic data collection is a major advance in the conduction of clinical trials compared to paper based documentation [1]. Data capture for observational studies or registries is often performed in spreadsheet-based applications like Microsoft Excel or directly in statistic software like IBM SPSS [25]. In any case, data is transferred into statistic software, such as SAS [6], R [7] or IBM SPSS Statistics [8], for analysis. Applications like Excel or SPSS are commonly used in academic research institutions: They are easy-to-use, relatively cheap and provide flexible data structures (variables can be added and removed as needed). In contrast, electronic data capture (EDC) systems are used to collect and manage data for interventional trials in a regulated setting.

In the following, we define data collection tools that are based on spreadsheets like Excel or SPSS as SBDC (spreadsheet-based data collection) software whereas EDC systems are understood as applications for the conduct of clinical trials. EDC systems must comply with regulatory requirements of pharmaceutical regulating authorities like the Food and Drug Administration (FDA) [9] or the European Medicines Agency (EMA) [10]. In contrast to SBDC systems, EDC software is usually used as remote data entry (RDE) system.

SBDC applications can save setup and training time, especially for smaller studies, but this kind of data capture suffers from several drawbacks: Documents are often stored on a local place or network share, not allowing shared access or simultaneous work. Further disadvantages are missing data security in terms of right and role based access control. Backup for SBDC databases is commonly performed manually by copying files to external storages. This may result in version conflicts especially when multiple researchers are involved. Usually, SBDC software does not support the workflow of clinical trials, e.g. event calendars, which are critical for longitudinal study design. Missing traceability of entered data is also a major concern. In this concern, a change log is not available, e.g. it cannot be audited who performed which data changes neither when nor why.

In contrast to SBDC applications, data collection with EDC systems can be managed for multiple users and sites. Central hosting with access via Internet enables trustworthy backups of the latest data including its change history [11]. Access rights and roles can be managed centrally. Due to regulatory requirements EDC systems for interventional trials must undergo a validation process according to regulations for electronic data capture in clinical trials [12] like Good Clinical Practice (GCP) [13] or FDA 21 CRF Part 11 [14]. In contrast to SBDC applications, EDC software is capable to comply with these regulations and designed to support an organized workflow from the creation of forms and the management of queries to the closure of the database.

Nevertheless, the interoperability of commercial and open-source EDC applications varies. Almost all systems are capable to export data as spreadsheet file for transfer into statistic software. In addition, many systems can import clinical values for instance from central laboratories. The Operational Data Model (ODM) from the Clinical Data Interchange Standards Consortium (CDISC) is a commonly supported transport format for EDC systems [15]. ODM is a format for defining the electronic case report form (eCRF), communicating and archiving metadata as well as patient data in clinical trials [12, 16]. Of note, it is capable to store a complete audit trail of captured data. Commercial and academic EDC-solutions like x4T-EDC [17] are able to directly create the trials’ database from the imported ODM data structure.

Pre- or pilot-studies are often conducted before large-scale clinical trials. When these pilot studies are successful, data collection needs to be upgraded to meet the requirements of multi-user and multi-center trials, in particular regulatory compliance, scalability and technical security. Clearly, EDC systems are the means of choice for remote data entry by multiple users and institutions. At present, the change towards an EDC system implies a complete new setup of the study database structure, which is a labor-intensive and error-prone manual process.

To our knowledge, no transformation approach or tool exists to support the conversion and exchange of research databases. Therefore, the aim of our software tool S2O is the conversion between SPSS and CDISC ODM format to foster the transfer of SBDC towards EDC systems, including data transformation. The second goal is to evaluate the conversion process regarding syntactic and semantic correctness and its limitations.