QoRTs: a comprehensive toolset for quality control and data processing of RNA-Seq experiments

The QoRTs software package consists of two distinct modules: a java package which
performs most of the data processing and a companion R package for visualization and
cross-replicate comparison. A recommended analysis pipeline is illustrated in Fig.Â 1.

Fig. 1. An example analysis pipeline with QoRTs. This flowchart illustrates the recommended
analysis pipeline for conventional RNA-Seq analysis using QoRTs. Input and intermediary
files are shown in blue, output files and results are shown in purple

All count files, QC statistics, and browser tracks for a given replicate can be generated
using a single command and over a single pass through the alignment file, greatly
streamlining the analysis pipeline. If desired, individual sub-functions can be deactivated
to reduce runtime.

QoRTs is both fast and efficient: it can generate a comprehensive array of quality
control metrics, browser tracks, summary plots, and read counts in 3â€“6 min per million
read-pairs. For typical genomes and annotations the QoRTs data processing utility
requires less than 4 gigabytes of free memory. The companion R-package (used for generating
plots and pdf reports) has much lower resource requirements and can generally run
on any desktop computer that can support R.

The java package was written in the Scala programming language and uses the Picard
sam-jdk API 12]. However, since all necessary libraries are compiled to java bytecode and packaged
in the distribution jar file, neither Scala nor Picard is required for use. QoRTs
is designed to run on any machine that has both java (version 6 or higher, 64-bit)
and R (3.0.2 or higher), without any additional dependencies.

The importance of quality control

Quality control in bioinformatics is a contentious issue, and the necessity and utility
of quality control metrics is often called into question. However, across the field
of bioinformatics there are numerous cases where biases, artifacts, and other data
quality issues have called results into question, sometimes resulting in retractions
13]â€“19]. In many of these cases the problems were only identified when the study came under
intense external scrutiny, and the specific issues at fault were not well-characterized
up to that point. Such data-quality issues can sometimes be corrected, but only after
they have been identified 20]. Thus: it is not sufficient to check for issues that are already well-known: quality
control must be proactive and comprehensive.

RNA-Seq data in particular has numerous inherent sources of bias including hexamer
bias, 3â€™ bias, GC bias, amplification bias, mapping bias, sequence-specific bias,
and fragment-size bias 5], 6], 21]. While most advanced RNA-Seq analysis tools are designed with (at least some of)
these effects in mind, they often still rely on the assumption these effects are consistent
between samples and uniform between experimental conditions 2], 22]â€“24]. Outliers, batch effects, and/or effects that vary disproportionately between the
experimental conditions can still have the potential to drive false associations.

Without proactive and comprehensive quality control it is not possible to be certain
that unobserved errors, biases, or artifacts do not violate the assumptions of downstream
analyses.

Quality control with QoRTs

Performing quality control with QoRTs requires two steps. First the (java-based) data-processing
module is run on each replicate, and then the companion R package is used for visualization
and cross-comparison of replicates.

Simple multi-replicate plots that differentiate each replicate individually (as offered
in a limited capacity by RSeQC and RNA-SeQC) may be adequate for small sample sizes;
however, with larger or more complex studies these plots may be unreadable due to
multi-plotting and insufficiently distinct coloration. QoRTs offers the ability to
organize and differentiate replicate groups by sample, sequencer-lane/run, or any
arbitrary grouping assigned by the user (such as biological condition). This allows
easier identification of systematic biases and artifacts in large-scale datasets.
By default QoRTs produces a battery of 34 plots, which are each described at length
in the package user manual (Additional file 1) 25]. Fig.Â 2 includes a subset of these plots generated for a small example dataset of 72 replicates
(12 samples, 6 technical replicates each). In this example, replicates are colored
and differentiated by biological condition. The standard battery of QC plots can be
automatically compiled into a single multi-frame image or as a printable pdf report.

Fig. 2. A small selection of the QC plots offered by QoRTs. This series includes 12 samples,
each consisting of 6 technical replicates (for a total of 72 bam files), with 4 different
biological conditions (3 samples per condition). In all nine plots, replicates are
colored and differentiated by biological group. In the line plots (c,d,e, and f) the
samples are simply colored by biological group. In other plots (a and g), replicates
are differentiated by character, color, and horizontal offset. This differentiation
allows easy identification of both outliers and systematic biases or errors associated
with the biological condition. Such systematic errors are of particular importance
as they could potentially drive false associations. A full description of each plot
and its interpretation can be found in the supplementary materials

The purpose of these various plots is to characterize the data in numerous ways, hopefully
revealing any artifacts, outliers, batch effects, or phenodata-associated effects.
In most cases any abnormalities should be revealed by multiple plots, and the various
metrics can assist in identifying the underlying causes and assessing whether downstream
analyses are likely to be adversely affected. The QoRTs user manual includes descriptions
of various potential issues and how they could be recognized and differentiated using
the available QC plots 25]. The user manual also includes an in-depth walkthrough of two examples in which QoRTs
was used to identify actionable quality control issues in a real-world dataset.

In one such example, a shift in the sequencer scanner at cycle 53 of read 2 resulted
in a small number of reads (less than 1Â %) being truncated (Fig.Â 3). Using the array of information provided by QoRTs we can not only identify the presence
of a QC issue, but also narrow down the root cause of the issue and predict its impact
on downstream analyses. In this example, the issue manifested as a large increase
in the rate of â€˜Nâ€™ bases beginning at this cycle and continuing to the end of the
read. Similarly, an abrupt increase in the alignment clipping rate was observed beginning
at this cycle. The fact that the issue was specific to one lane (see Fig.Â 3c and d), rather than being specific to any particular sample (see Fig.Â 3a and b) implied that the issue likely originated at the sequencing step rather than at sample
or library preparation. The fact that the alignment clipping rate jumped so dramatically
at cycle 53 indicated that the root cause was a massive increase in the â€˜Nâ€™ rate in
a small subset of the reads, rather than being a more subtle increase distributed
across all reads.

Fig. 3. Example issue detected via QoRTs. A subset of the output plots from a dataset in which
a rare hardware-level fault produced an actionable QC issue that can be easily identified
via QoRTs. In (a) and (b) the replicates are colored by biological sample; in (c) and (d) replicates are colored by sequencer lane. See the QoRTs vignette for more information
(Additional file 1)

For most datasets these plots should not reveal anything of interest: RNA-Seq is a
relatively mature technology and large-scale systematic errors should (theoretically)
be rare. However, when such errors do occur it is critical that they be caught before
the flawed data is analyzed and the results reported.

Data processing for downstream analysis

In addition to its primary function as a quality control tool, QoRTs automatically
generates all input read-count files needed for use with a number of differential
expression/regulation analysis tools. Gene-level read counts are generated using the
same methodology specified by HTSeq and reproduced in the Bioconductor GenomicRanges
package (using the default â€œunionâ€ rule) 26], 27]. QoRTs also generates the exon-level counts and related annotation files required
by DEXSeq 22].

QoRTs can also (optionally) produce a number of browser track files designed for use
with the UCSC genome browser or the IGV viewer 28]â€“30]. QoRTs produces â€œwiggleâ€ files which can be used to view simple coverage depth across
evenly-spaced windows across the genome (similar to those produced by the samtools
â€œbam2wigâ€ utility) and specialized â€œbedâ€ files which display coverage depth bridging
any known or novel splice junctions, providing functionality similar to the â€œsashimiâ€
plots generated by IGV 30], 31]. QoRTs also provides tools for generating summary tracks that display mean normalized
coverages across multiple samples.

Comparison with existing tools

QoRTs offers and improves upon many of the features offered by the two other major
RNA quality control tools: RSeQC and RNA-SeQC (see TableÂ 1).

Table 1. Features and capabilities of QoRTs compared with those offered by other tools

The RNA-SeQC software package lacks many vital quality control metrics 8]. It does not calculate nucleotide-by-cycle, â€œNâ€-rate by cycle, insert size distribution,
clipping profile, cigar profile, or any splice-junction-related statistics. While
it may be sufficient for some purposes, the absence of these critical QC statistics
may allow biases, artifacts, or errors to go undetected.

The RSeQC software package, which ostensibly features a number of the functions implemented
in QoRTs, possesses numerous systematic bugs and flaws that cause it to consistently
produce erroneous and/or misleading results across several critical QC metrics 7]. For the purposes of internal testing we generated a variety of simple simulated
SAM alignment files, each containing up to a dozen ten-base-pair reads. Both QoRTs
(version 0.2.5, released March 5
^th
, 2015) and RSeQC (version 2.6.1, current as of March 5
^th
, 2015) were run on these example reads. Much of the resultant QC data generated by
RSeQC was found to be inaccurate. Documentation of a subset of these inconsistencies
is provided in the supplementary materials (see Additional file 2). Many of these inaccuracies could potentially serve to obfuscate real quality control
issues or falsely suggest the presence of nonexistent issues. The fact that such numerous
and fundamental errors remain present in a fully mature two-year-old software tool
demonstrates that RSeQC has not been subject to sufficient testing.

In addition, both RSeQC and RNA-SeQC only provide very limited tools for visual cross-comparison
between replicates. The few cross-comparison plots that are available simply plot
all replicates over the same plotting area, each in a different color. QoRTs can generate
plots that contrast and differentiate groups of replicates, allowing easy identification
of systematic biases or errors.