Investigation into the annotation of protocol sequencing steps in the sequence read archive

Bias and protocols in next-generation sequencing

The introduction of next-generation sequencing technologies has transformed the fields
of genomics and transcriptomics 1],2]. Much of the raw sequence read data are being deposited in public repositories such
as the Sequence Read Archive (SRA) 3], Gene Expression Omnibus (GEO) 4] and ArrayExpress 5]. To produce and sequence genetic and transcriptomic data to be deposited in a public
data repository such as GEO or SRA, a sample undergoes an intricate series of chemical
reactions. The data are then processed before being deposited.

Prior to being sequenced, a nucleic acid sample will undergo a number of steps: sample
preparation, nucleic acid extraction, chemical modification (blunting, phosphorylation,
ligation of instrument specific synthetic chemical sequence adapters) and chemical
amplification. These steps are outlined in Figure 1. Sequencing itself is massively parallel. The results of such high-throughput next-generation
sequencing workflows allow the characterisation of millions to billions of reads in
a matter of days and generate large-scale data sets 1].

Figure 1. A typical next-generation sequencing workflow. The sequencing workflow is shown by the black arrows; red arrows depict the metadata
that should be captured from these sequencing workflow steps. We have focused on the first three
major steps.

An extensive body of literature exists within the bioinformatics community describing
the workflows used to analyse short-read data from next-generation sequencers. More
than 80 papers are listed in the review of Miller, Koren and Sutton (2010) related
to the assembly of sequence read data alone 6]. On the other hand, comparatively little work has been done on thoroughly determining
how the protocol steps prior to sequencing can affect the final results 7]. As we will demonstrate in Next-generation sequencing protocol steps prior to sequencing
section some of these steps are known to be prone to introducing bias in the sequencing
data derived from the sample. This bias manifests as a deviation from the ideal uniform
distribution of reads 8] and is an important factor in both genome assembly (which requires sufficient reads
to form overlaps of sequences to assemble contigs) and likely impacts on expression
studies that rely on the quantification of a sequence expressed (transcribed) in a
sample 9]. In the following sections we discuss the various protocol steps and sources of bias
they may introduce.

Bioinformatics studies seeking to characterise and model systematic errors are important,
and in the case of platform specific biases, such methods can be applied in the interim
before the technology is refined. To apply these methods to existing sequencing datasets,
adequate metadata are required in the repository databases from which they are sourced.

An example of a systematic error that has been characterised and modelled are base
call errors. According to Meacham et al.9], these errors are a common feature where sequencing and next-generation sequencing
technologies are used. Although these technologies have significantly reduced the
costs and increased throughput associated with sequencing, they have been shown to
induce more errors than preceding technologies. Focusing on the Illumina platform,
with a view to demonstrating the potential impact on biological inferences, Meacham
and colleagues characterised systematic errors (positional and sequence specific)
that could be misinterpreted as heterozygous sites – in both individuals and SNPs
– in population analyses. They found that the majority of systematic errors were sequences
preceded by a G; the most common being GGT, where a T is substituted for a G. Dohm
et al.10] also demonstrated that wrong base calls are frequently preceded by base G, thus indicating
that the Illumina base-caller software has difficulty in differentiating between GGG
and GGT.

Similarly, Hansen et al.11] investigated biases in RNA-seq data resulting from random hexamer priming; a method
used in library preparation of dsDNA samples from RNA to be sequenced on the Illumina
Genome Analyser. Their work demonstrated that random hexamer priming results in non-uniform
distribution of reads resulting from positional influence on nucleotide frequencies
in nucleotides up to the 13th nucleotide from the 5? end of the reads. This positional
influence was reproduced across experiments, indicating the occurrence of a consistent
bias. An outcome of their work is a bias count reweighting scheme, which was developed
to mitigate the impact of these biases.

Finally, a relevant study on motifs that induce sequencing errors was undertaken by
Allhoff et al.12], who described a statistical framework for identifying sequence specific errors caused
as a result of preceding bases, which they term context-specific errors (CSEs). Their
method involved pooling genomic positions and screening for strand biases; a method
they demonstrate to yield greater statistical power for identifying biases. Cheung
et al.13] studied ChIP data from Illumina’s Genome Analyser and found three types of systematic
sequencing errors: those caused by GC content, mappability of sequencing reads, and
regional biases that might be generated by local structure. They devised a normalisation
scheme that can be applied to downstream data analyses.

A thorough understanding of the protocols that are applied prior to sequencing could
provide much more subtle analyses to the ones applied above, which vary according
to the pipeline of protocol steps. Furthermore, it would enable best practice guidelines
for these protocols to come to the fore. In the following section we discuss the range
of potential biases introduced during some of the steps occurring prior to sequencing
and, while a thorough study is absent, we demonstrate evidence to suggest that some
of these biases exist.

Next-generation sequencing protocol steps prior to sequencing

The core workflow processes that are shared by next-generation sequencing technologies,
and which involve protocol steps where biases may be introduced are shown in Figure 1. The protocol steps carried out before sequencing are classified into three classes:
DNA fragmentation, DNA ligation and DNA enrichment. It should be noted that various
protocols exist that do not require the fragmentation of DNA prior to ligation, such
as PCR amplicon methods.

DNA fragmentation

Next-generation sequencing platforms require fragmentation of double-stranded DNA
(dsDNA) into sequence fragments (fragment templates or mate-pair templates) of an
appropriate size, as dictated by the read-length of the platform 14]. There are currently four methodologies in use for fractionating dsDNA

Enzymatically (with restriction endonucleases);

Sonication 14];

Nebulisation 15];

Hydrodynamic shearing.

Bias induced during the fragmentation protocol step results in a large size distribution
of fragment lengths.

Enzymatic fragmentation employs type II restriction endonucleases to cleave dsDNA at (or at close proximity
to) short (3–8 bp) recognition sequence sites 16]. However, this method is known to introduce bias due to factors that might impair
the activity of sequence site recognition. Kamps-Hughes et al.17] utilised Illumina high-throughput sequencing to assay the enzymatic activity of type
II restriction endonucleases. They examined cognate site cleavage and non-specific,
non-cognate site cleavage (referred to as star activity) of restriction endonucleases
(EcoRI and Mefl) by mapping millions of site-flanked reads back to the Escherichia coli and Drosophila melanogaster genomes. Their study demonstrated that, despite the high sequence specificity these
enzymes exhibit, this characteristic is dependent on a number of factors such as enzyme
concentration, sequence context, buffer concentration and nucleotides flanking the
cleavage site.

A DNA sample may also be fractionated by sonication, a method in which the dsDNA is subjected to short periods of agitation by sound
energy to generate fragmented DNA as a result of hydrodynamic shearing stresses 14]. Chromatin complexes of DNA and proteins have been shown to be refractory to shearing
by sonication, and this results in under-representation of the sequences affected
18].

Fragmentation by nebulisation forces the DNA solution through a small hole. This produces a fine mist of aerosol
droplets containing fractionated dsDNA in which fragment size is determined by the
viscosity of the DNA solution, speed at which the solution is ejected, pressure of
the gas and temperature 15]. Hydrodynamic shearing is a method of DNA fragmentation that involves injecting the sample solution through
a narrow diameter orifice at high velocity. The resulting shearing stresses on the
DNA strands cause them to break, resulting in an approximately normally distributed
fragment size. Swartz and Farman investigated the effect of hydrodynamic shearing
on the sequencing of telomere-associated sequences 19]. They state that searches for telomeric sequences in fungal genomic databases typically
do not yield many results, and hydrodynamic shearing may be a cause of this. They
found that sub-terminal regions of DNA are resistant to shearing, with breakages only
occurring at the next cleavable location in relation to the terminal end of the fragments.
This results in an over-representation of terminal fragments, but an under-representation
of telomeric regions; as all terminal fragments are cleaved at a similar location,
no contigs exist to connect the terminal and sub-terminal sequences.

DNA ligation

Blunting

Unwanted 5? and 3? overhangs are removed from the double-stranded dsDNA to facilitate
the ligation of platform specific synthetic DNA sequence adapters to the fragments
– a process termed blunting. A number of enzymes can be utilised for this purpose
such as Klenow DNA polymerase, T4 DNA polymerase and mung bean nuclease. The enzyme is used to repair the ends of the dsDNA fragments by ensuring that the
ends of the complementary strands are in line with each other. Such polymerase enzymes
possess 5’???3? polymerisation activity and 3’???5? exonuclease activity, but lack
5’???3? exonuclease activity. The effect is that 3? overhangs on sDNA fragments are
removed by the 3’???5? exonuclease activity. The lack of 5’???3? exonuclease activity
in these enzymes means that 5? overhangs remain intact and any complementary 3? receding
strands are extended and brought in line with the 5? overhung strand by the enzyme’s
5’???3? polymerase activity. This ensures both ends are “blunted”, i.e. there is no
single-stranded DNA overhang. The fidelity of polymerase enzymes used in this step
is variable. Klenow polymerase has been shown to have an average error rate (mutations
per base replicated) of 1.3?×?10?418].

Phosphorylation

Since polymerase activity occurs in the 5? to 3? direction, it is necessary to phosphorylate
the 5? ends of the blunted fragments. This can be carried out enzymatically using
T4-PNK (polynucleotide kinase), which catalyses the transfer of the ?-phosphate of
ATP to the 5? hydroxyl end. The efficiency of T4-PNK varies depending on the 5? nucleotide,
and this can manifest itself as bias if a proportion of fragment ends remain unphosphorylated.
Differences in the binding interactions between T4-PNK and the kinase substrate result
in T4-PNK exhibiting bias in the preference of certain nucleotides at the first and
second sequence positions of the substrate resulting in a greater activity in the
phosphorylation of 5? G than for 5? C 20].

Attachment of a overhang to dsDNA 3? ends or blunt-ended ligation

Synthetic sequencing adapters (such as those used on the Illumina platform) normally
possess a 5? T overhang to facilitate their ligation to the fragments to be sequenced.
It follows that molecules in the sequencing fragment library must possess a complementary
3? A overhang; a genetically modified Klenow exo-minus is usually used to achieve
this 21]. The enzyme possesses no exonuclease activity, but retains 5’???3? polymerase activity.
This is used to catalyse the attachment of A overhangs to the 3? end of the sequencing
fragments so as to complement 5? T overhangs on the platform specific adapters. Alternatively,
blunt-ended ligation can also be used to ligate sequence adapters with sequencing
fragments.

Library preparation methods utilising DNA ligase, which is able to ligate dsDNA fragments
with 5? A’ overhangs to synthetic sequence adapters with 5? T overhangs, have been
shown to be biased against sequences starting with T as opposed to blunt-ended ligation
22].

In their study on the impacts of Illumina sequencing bias and its implications on
characterising ancient DNA, Seguin-Orlando et al.23] sequenced modern DNA in parallel with different ligation strategies. In order to
eliminate shearing as a source of the bias, they used sheared samples using different
methods for the same ligation strategy. Their results show that the bias against sequence
fragments with 5? T was unlikely to be due to the shearing method; rather it is a
result of the 3? A to 5? T overhang ligation method (a primary method used by Illumina
platforms). Furthermore, this correlates inversely with the concentration of sequence
adapters, which is normally kept low so as to prevent hybridisation of the adapters
with each other. They explain how the post-mortem degradation of ancient DNA (resulting
in C deamination to U) generates misincorporation patterns in the sequencing library
that can be used to recognise and characterise true ancient DNA. These patterns can
also be altered during certain library construction protocol methods; Seguin-Orlando
and colleagues cited the Taq and Phusion polymerase enzymes, which are integral to Illumina sequencing protocols, as a cause
of this undesired modification during library preparation.

Adapter ligation

The ligation of synthetic dsDNA sequencing adapters (with 5? T overhang) to the fragment
dsDNA (5? phosphorylated, with 3? A-overhang) again requires the use of DNA ligase,
which is added in excess (concentration of 10:1) so as to ensure the attachment of
as many adapters as possible per unit time. Housby et al.22] point out that most studies of the DNA replication process have centred on the fidelity
of DNA polymerase and the importance of understanding all the mechanisms that ensure
faithful copying of DNA sequences during replication.

DNA enrichment

In order to achieve sufficient quantities of sequencing samples for sequencing, an
enrichment process must be applied to the adapter-ligated fragment library. The polymerase
chain reaction (PCR) is a mainstay method in DNA enrichment 24]. It is useful for enriching the fragment library since it replicates only those fragments
to which an adapter, encapsulating a PCR primer binding region, is attached. Those
fragments not ligated to adapters will not be replicated by virtue of lacking the
PCR primer site which is located on the ligated adapter. However, PCR amplification
may introduce bias in the form of non-uniform distribution of reads; it can introduce
artefacts into the library prepared for sequencing 25]. The significant variation in the fidelity of polymerase enzymes on which PCR depends
has long been established 18]. There are a number of origins for such artefacts: re-arrangement of the DNA resulting
in Chimera formation, formation of hetero-duplex molecules, and DNA polymerase errors.
Further discussion of these is beyond the scope of this article; however, we direct
the reader to a study by Acinas et al.25], who looked at PCR-induced artefacts in sequencing library construction.

It has been long established that PCR is impaired by GC-bias in the fragments to be
enriched 24]-26]. Kozarewa et al.24] demonstrated that using a PCR amplification-free enrichment step that relies solely
on cluster amplification on the sequencing platform for the enrichment of the library,
resulted in a more uniform distribution of reads.

Given the biases in polymerase activity 27],18], a number of commercially produced genetically modified polymerase enzymes have been
developed to confer greater fidelity. An investigation by Quail et al.28] compared the fidelity of two commercially available polymerase enzymes, Kappa HiFi and Phusion polymerase, against PCR-free sequencing of four genomes of varying GC content. They
demonstrated variation between these two polymerases and found the profile of Kappa-Hifi (as depicted by plots of normalised sequencing depth vs. % coverage) to be closer
to the profile seen with no amplification, as compared with Phusion polymerase.

DNA damage has also been shown to influence nucleotide incorporation and can introduce
bias dependent on the preferences of the polymerase catalysing the reaction. Investigation
into nucleotide incorporation preferences of different polymerases can be achieved
using modified nucleotide 8-Oxo-7,8-dihydro-2?-deoxyguanosine (8-oxodG), which can
exhibit both an anti-conformation (allowing normal Watson–Crick base pairing) or syn-conformation (that undergoes Hoogsteen bonding) 29]. Sikorsky et al. investigated this and describe how the ratio of dCMP to dAMP insertion, corresponding
to 8-oxodG, is dependent on the class of polymerase. They found that both the fidelity
and amplification efficiency of Taq DNA polymerase are susceptible to lesions on the fragment to be enriched.

Other enrichment methodologies can also be a source of bias; for example, multiple
strand displacement amplification (MDA) can result in preferential amplification of
certain sequences 30].

Annotation of publicly deposited data

We have described the above-mentioned sequencing workflow steps that have the potential
to introduce bias into the resulting DNA sequencing data, particularly focusing on
sequence read DNA data and the relevant metadata. However, RNA sequencing data have
also been demonstrated to be prone to biases in protocol steps. Random hexamer priming
11] is an example, though other sources may well exist. This requires similar study but
is beyond the scope of this paper at this point.

There exist a number of different standards for annotating genomic and transcriptomic
data sets, including Minimum Information for a Micro-array Experiment (MIAME) 31] and Minimum Information for a Sequencing Experiment (MINSEQE) 32], as well as document markup formats including Microarray Gene Expression Markup Language
(MAGE-ML). Ostensibly, appropriate use of the above standards (notwithstanding the
absence of an agreed vocabulary) should ensure that one can disentangle the effects
of different biases. We note that SRA 3], the focus of this paper, conforms to MINSEQE requirements.

For the remainder of this paper we will explore the level of annotation of the preparatory
protocols in data publicly deposited in the SRA, one the main repositories for next-generation
sequencing data. In particular, we will describe in detail the SRA metadata schema
and the relevant fields for this study. Having constructed a list of relevant keywords
for the preparatory steps described above, we will present their relative frequency
in the SRA metadata and the non-standard methods used to describe the annotation.
Finally, we will discuss the implications of the low level of coverage of these keywords
and the overall structure of metadata in the SRA and how this inhibits a more systematic
study.