Bioboxes: standardised containers for interchangeable bioinformatics software


The Docker platform 5] allows the creation of lightweight containers in which developers can install their
software along with all required libraries and scripts. These containers can then
be easily shared through a central repository, or as compressed files, and used in
the same way as if the software itself were installed. The bioinformatics field has
quickly recognised the opportunity provided by Docker 6], 7], in which containers do not dictate a specific software framework or language for
implementing bioinformatics tools, and which allows integration with existing software.

Containerisation further has the potential to solve the problems of software availability
and installation outlined above, where bundling all dependencies removes the need
for the user to compile and install anything (except Docker itself). Software containers
also provide researchers with the option to reproduce existing published results so
as to replicate or expand on the work of others. An example of this are nucleotid.es
8] and the Critical Assessment of Metagenomic Interpretation (CAMI) 9] projects, where the tools benchmarked are containerised and available for download
by users.

Even with these outlined advantages, without standardisation bioinformatics will continue
to suffer from mismatching interfaces between tools in software pipelines. The time-consuming
job of maintaining these pipelines then falls to the bioinformatician, reducing their
role from computational researchers to the custodians of gluing different tools together.

To this end we, developers involved in both CAMI and nucleotid.es, have created the
bioboxes project 10] with the aim of specifying standardised bioinformatics containers. A biobox is a
software container with a standardised interface that describes what kind of input
files and parameters are accepted and which output files are to be returned. An example
is a short-read assembler that takes an input paired-FASTQ file and returns a contig
FASTA file. Each developer creating a biobox should make sure the container accepts
these inputs and returns the expected outputs.

Specifying the same interface for the same task allows one tool to be swapped for
another in a pipeline. This creates an interchangeable parts list for researchers,
which, combined with Docker containerisation, means that biologists and bioinformaticians
have access to, and can immediately use, a large body of bioinformatics software.
Figure 1 contrasts the existing state of bioinformatics software with bioboxes standardised
software containers. Box 1 shows a python command line interface to bioboxes that
allows the reader to test out a biobox.

Fig. 1. Comparison of the current software situation in bioinformatics (top) with using biobox Docker containers with standardised interfaces (bottom)