MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression

Pre-Processing

MetaCRAM accepts both unpaired and paired-end reads. If paired-end reads are given
as an input to MetaCRAM, then the first preprocessing step is to append the read IDs
with a “_1” or a “_2” indicating that the read came from the first or second mate,
respectively. Another preprocessing step includes filtering out the quality scores
in case that the input file is in FASTQ format. This filtering process allows for
using new and emerging quality score compression methods without constantly updating
the MetaCRAM platform. Note that the paired end labeling is done automatically, while
filtering can be implemented outside the integrated pipeline by the user, based on
his/her requirements for quality score lossy or lossless compression goals.

MetaCRAM uses as a default FASTA files that do not contain quality values, in which
case the resulting SAM file contains the symbol “I” repeated as many times as the
length of the sequence. These symbols amount to about 100 bytes per read, and this
overhead increases proportionally to the number of reads (Table 2 of the “Results” Section illustrates the amount of storage space that data quality
scores occupy in each dataset, ranging from 153 MB to 3.4 GB). In order to reduce
the size of this unnecessary field, MetaCRAM replaces the sequence of “I”s with a
single symbol “*”, complying with the standard SAM format. Likewise, read IDs are
highly repetitive in nature: for instance, every read ID starts with the data name
such as “SRR359032.”, followed by its unique read number. Rather than repeating the
data name for every read, we simply store it once, and append it when performing decompression.
Both versions of MetaCRAM – one incorporating these two options – and another one
without the described features are available to the user. The former version of the
methods requires a slightly longer compression and decompression time.

Taxonomy identification

Given the labeled read sequences of a metagenomic sample, the first step is to identify
the mixture of species present in the sample. There are several taxonomy identification
methods currently in use: the authors of 31] proposes to use the 16S rRNA regions for bacterial genome identification, MetaPhyler
32] scans for unique markers exceeding length 20 and provides a taxonomy level as specific
as the genus. On the other hand, a new taxonomy identification software known as Kraken
15], based on exact alignment of k-mers to the database of known species, often outperforms MetaPhyler and other methods
both in terms of speed and discovery of true positives, as indicated by our tests.

MetaCRAM employs Kraken as a default tool in the pipeline. Kraken produces an output
report which is automatically processed by MetaCRAM. Part of the report contains information
about species present in the sample, as well as their abundance. We rank order the
species in from the most abundant to the least abundant, where abundance is based
on the number of reads identified to match a species in the database. For downstream
analysis, MetaCRAM selects the “most relevant” species and uses their genomes as references.
The default definition of “most relevant” is the top 75 species, but one has the option
to choose a threshold for the abundance value or for the number of references used.
As an illustration, Table 1 lists the results of an analysis of the impact of different thresholds on the processing
time and the compression ratio.

Alignment and assembly

After a group of reference genomes is carefully chosen based on the Kraken software
output, alignment of reads to the reference genomes is performed. This task is accomplished
by using Bowtie2, a standard software tool for ultra-fast alignment of short reads
to long genomes. The alignment information is stored in a SAM (Sequence Alignment/Map)
file format and subsequently used for compression via reference-based algorithms.

Due to the fact that many species in a metagenome sample have never been sequenced
before, some reads will not be aligned to any of the references with high alignment
scores, and we collectively refer to them as unaligned reads hereafter. In order to discover reference genomes for unaligned reads, we assemble
the unaligned reads in a relatively efficient, although often time consuming manner
using a metagenomic assembler. Our metagenomic assembler of choice is IDBA-UD 18], given that in our tests it produced the largest number of contigs leading to new
reference identification. Alternatives to IDBA-UD include the Ray Meta software 33].

When the reads have high sequencing depth and large overlaps, the contigs produced by the assembler may be queried using BLAST to identify the organisms they
most likely originated from. The user may choose to BLAST only the top n longest contigs, where n is a user specified number, but in our analysis we use all contigs (which is also the default setting). Subsequently, we align the unaligned
reads to the newly found references.

In some rare cases, the assembler may fail depending on the number of species in the
metasample and the sequencing depth, in which case one may want to skip the assembly
step and compress the unaligned reads in a reference-free manner. For reference-free
compression, the software of choice in MetaCRAM is MFCompress 14]. As long as the assembler is successful, one can reduce the volume of unaligned reads
by iterating the process of assembly, BLAST, and alignment as illustrated at the top
right hand of Fig. 1. All our demonstrations and results are based on two iterations of the described
algorithmic loop.

Distribution of read starting positions

We empirically studied the distribution of integers representing the read positions,
variation positions, and paired-end offsets in order to choose the most suitable compression
method. As an example, the distribution of the starting positions for the reads that
aligned to JH603150 (genome of Klebsiella oxytoca) in the dataset SRR359032 is shown
in Fig. 4. This distribution was truncated after achieving a 90 % coverage of the data (i.e.,
after only 10 % of the read start positions exceeded the depicted maximum length).
The empirical distribution is shown in yellow, while a fitted power law distributions
is plotted and determined according to 22], with , where i is the integer to be encoded, and m is the divisor in the extended Golomb code. The parameter choose shown is m=3 and 4. The negative binomial distribution is fitted using Maximum Likelihood Estimation
(MLE), while the Geometric distribution is fitted by two different means: using MLE
and ezfit, a MATLAB script which performs an unconstrained nonlinear minimization
of the sum of squared residuals with respect to various parameters.

Fig. 4. Integer Distribution. Distribution fitting of integers to be encoded, truncated at
90 % of the integer data

For single reference alignment methods, it was reported that the best fit for the
empirical distribution is a geometric distribution or a negative binomial distribution
34]. However, due to sequencing errors and non-uniform distributions of hydrogen bond
breakage, the empirical data often deviates from geometric or negative binomial distributions
35]. In addition, for metagenomic samples, there exist multiple references which may
have good alignments with reads that did not originally correspond to the genomic
sample of the reference. This creates additional changes in the read starting position
with respect to the geometric distribution. Moreover, one has to encode not only the
read positions but also the variation positions and paired-end offsets, making it
difficult to claim any one of the fitted distributions is better than others. This
observation is supported by Fig. 4. Since there is no known efficient optimal encoding method for a set of integers with negative binomial distributions,
and Golomb and extended Golomb encoding are optimal for geometric distributions and
power law distributions, respectively, we use these two methods with m=3. The parameter m is chosen based on extensive experiments, although the user has the freedom to adjust
and modify its value.

As the number of unaligned reads that remains after a few iterations of MetaCRAM is
relatively small, these reads were compressed using a reference-free tool such as
MFCompress 14], which is based on finite-context models. Furthermore, the SAM files produced after
running Bowtie2 are converted to the sorted and indexed binary format of a BAM file
using SAMtools 36]. Each BAM file is compressed via reference-based compression against its representative
to a standard CRAM format. We tested three different modes of the CRAM toolkit 11]: Huffman, Golomb, and Extended Golomb encoding, all of which are described in the
next section. Note that the Extended Golomb encoding method is our new addition to
the classical CRAM method, as it appears to offer good compromises between compression
and decompression speed and compression ratios.

Intrinsically, SAM files contain quality values and unique read IDs for each read,
which inevitably account for a large file size: quality values are characters of length
as long as the sequence, and read IDs often repeat the name of the dataset. By default,
MetaCRAM preserves all quality values and read IDs as designed in CRAM.

Compression

Compression in the reference-based mode is accomplished by compressing the starting
points of references with respect to the reference genomes and the base differences
between the reads and references. As both the starting points and bases belong to
a finite integer alphabet, we used three different integer compression methods, briefly
described below.

Huffman coding is a prefix-free variable length compression method for known distributions
20] which is information-theoretically optimal 37]. The idea is to encode more frequent symbols with fewer bits than non-frequent ones.
For example, given an alphabet A=(a,b,c,d,e) and the corresponding distribution P=(0.25,0.25,0.2,0.15,0.15), building a Huffman tree results in the codebook C=(00,10,11,010,011). Decoding relies on the Huffman tree constructed during encoding
which is stored in an efficient manner, usually ordered according to the frequency
of the symbol. Due to the prefix-free property, Huffman coding is uniquely decodable
coding and does not require any special marker between words. Two drawbacks of Huffman
coding that make it a costly solution for genomic compression are its storage complexity,
since we need to record large tree structures for big alphabet size which arise when
encoding positions in long sequences and the need to know the underlying distribution
a priori. Adaptive Huffman coding mitigates the second problem, at the cost of increased computational
complexity associated with constructing multiple encoding trees 38]. In order to alleviate computational challenges, we implemented so called canonical Huffman encoding, which bypasses the problem of storing a large code tree by sequentially
encoding lengths of the codes 39].

Golomb codes are optimal prefix-free codes for countably infinite lists of non-negative
integers following a geometric distribution 21]. In Golomb coding, one encodes an integer n in two parts, using its quotient q and remainder r with respect to the divisor m. The quotient is encoded in unary, while the remainder is encoded via truncated binary encoding. Given a list of integers following a geometric distribution with known
mean ?, the dividend m can be optimized so as to reduce code length. In 40], the optimal value of m was derived for m=2
k
, for any integer k. The encoding is known as the Golomb-Rice procedure, and it proceeds as follows:
first, we let , where . Unary coding represents an integer i by i ones followed by a single zero. For example, the integer i=4 in unary reads as 11110. Truncated binary encoding is a prefix-free code for an
alphabet of size m, which is more efficient than standard binary encoding. Because the remainder r can only take values in {0,1,…, m-1}, according to truncated binary encoding, we
assign to the first 2
k+1
?m symbols codewords of fixed length k. The remaining symbols are encoded via codewords of length k+1, where k=?log2(m)?. For instance, given n=7 and m=3, we have 7=2×3+1, implying q=2 and r=1. Encoding 2 in unary gives 110 and 1 in truncated binary reads as 10. Hence, the
codeword used to encode the initial integer is the concatenation of the two representations,
namely 11010.

Decoding of Golomb encoded codewords is also decoupled into decoding of the quotient
and the remainder. Given a codeword, the number of ones before the first zero determines
the quotient q, while the remaining k or k+1 bits, represents the remainder r according to truncated binary decoding for an alphabet of size m. The integer n is obtained as n=q×m+r.

Golomb encoding has one advantages over Huffman coding in so far that it is computationally
efficient (as it only requires division operations). One does not need to the distribution
a priori, although there are clearly no guarantees that Golomb coding for an unknown distribution
will be even near-optimal: Golomb encoding is optimal only for integers following
a geometric distribution.

An extension of Golomb encoding, termed extended Golomb 22] coding, is an iterative method for encoding non-negative integers following a power
law distribution. One divides an integer n by m until the quotient becomes 0, and then encodes the number of iterations M in unary, and an array of remainders r according to an encoding table. This method has an advantage over Golomb coding when
encoding large integers, such is the case for read position compression. As an example,
consider the integer n=1000: with m=2, Golomb coding would produce q=500 and r=0, and unary encoding of 500 requires 501 bits. With extended Golomb coding, the
number of iterations equals M=9 and encoding requires only 10 bits. As an illustration, let us encode n=7 given m=3. In the first iteration, 7=2×3+1, so r1
=1 is encoded as 10, and q1
=2. Since the quotient is not 0, we iterate the process: 2=0×3+2 implies r2
=2, which is encoded as 1, and q2
=0. Because the quotient is at this step 0, we encode M=2 as 110 and r=r2r1
=110, and our codeword is 110110.

The decoding of extended Golomb code is also performed in M iterations. Since we have a remainder stored at each iteration and the last quotient
equals q M
=0, it is possible to reconstruct the original integer. Similar to Golomb coding,
extended Golomb encoding is computationally efficient, but optimal only for integers
with power law distribution.

There are various other methods for integer encoding, such as Elias Gamma and Delta
Encoding 41], which are not pursued in this paper due to the fact that they do not appear to offer
good performance for the empirical distributions observed in our read position encoding
experiments.

Products

The compressed unaligned reads, CRAM files, list of reference genomes (optional),
alignment rate (optional), contig files (optional) are all packaged into an archive.
The resulting archive can be stored in a distributed manner and when desired, the
reads can be losslessly reconstructed via the CRAM toolkit. Additional file 3 contains software instructions, and detailed descriptions of created files and folders
by MetaCRAM processing are available in Additional file 4.

Decompression

Lossless reconstruction of the reads from the compressed archive is done in two steps.
For those reads with known references in CRAM format, decompression is performed with
an appropriate integer decompression algorithm. When the files are converted back
into the SAM format, we retrieve only the two necessary fields for FASTA format, i.e.,
the read IDs and the sequences printed in separate lines. Unaligned reads are decompressed
separately, through the decoding methods used in MFCompress.

Post-processing

The two parts of reads are now combined into one file, and they are sorted by the
read IDs in an ascending order. If the reads were paired-end, they are separated into
two files according to the mate “flag” assigned in the processing step.

Effects of parallelization

One key innovation in the implementation of MetaCRAM is parallelization of the process,
which was inspired by parallel single genome assembly used in TIGER 42]. Given that metagenomic assembly is computationally highly demanding, and in order
to fully utilize the computing power of a standard desktop, MetaCRAM performs meta
assembly of unaligned reads and compression of aligned reads in parallel. As shown
in Table 6, parallelization improves real, user, and system time by 23–40 %.

Table 6. Processing time improvements for two rounds of MetaCRAM on the SRR359032 dataset (5.4
GB, without removing redundancy in description lines) resulting from parallelization
of assembly and compression