Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data

Genome-wide mouse hairpin bisulfite-sequencing data creation

We first generated genome-wide hairpin bisulfite-sequencing data for mouse ES cells
(E14TG2a) using the technology developed by our lab (GEO accession number GSE48229)
15]. To induce differentiation, the mouse ES cells were cultured in ES cell culturing
media for six days without LIF (Leukemia inhibitory factor). Genomic DNA isolated
from the undifferentiated (E14-d0) and differentiating (E14-d6) states of mES cells
were used for library construction. Briefly, after DNA sonication, end repair, and
dA tailing, the DNA fragments were further ligated to Biotin-modified hairpin adapter
and Illumina TruSeq adapters. Adapter-ligated DNA was digested with MseI and MluCI
(NEB) to enrich regions with high CpG density, and then pulled down using Dynabeads
^Â®MyOneTM Streptavidin C1 beads (Invitrogen). After bisulfite conversion and PCR, size
selection of 400-600 bp fragments was conducted to yield longer sequences that are
more amenable for unambiguous mapping to the reference sequence.

Unconverted sequence recovery

Figure 1 shows how the bisulfite treated hairpin PCR sequencing technology works and how the
original non-bisulfite sequence can be recovered. In step 0, the hairpin connector
is attached to opposing Watson and Crick strands. The opposing strands are denatured
in step 1 and then treated with bisulfite in step 2. Bisulfite treatment converts
un-methylated cytosine to uracil. Step 3 involves PCR amplification, which copies
the forward and reverse sequences and converts uracil into thymine. In step 4, paired-end
sequencing technology is used to sequence from the 5? ends. This gives a sequence
pair. Notice that one strand is enriched for Ts since the PCR and bisulfite treatment
converts unmethylated Cs to Ts. The other strand is enriched for As since the Crick
strandâ€™s un-methylated Cs become Ts, which are complemented with As. These As are
then paired with Gs in the opposing strand. Finally, step 5 recovers the original
sequence. When the T-enriched strand has a T and the A-enriched strand has a C, this
maps to a C, and when the T-enriched strand has a G and the A-enriched strand has
an A, this maps to a G. All other mismatches are due to errors in PCR or sequencing
since the two strands should match up perfectly as they come from opposing sequences
in the DNA.

Software installation and use

All software development and analysis was performed on the bioinformatics clusters
hosted by the Virginia Bioinformatics Institute and the Virginia Tech CS department.
Two of the machines consist of two quad core processors (Intel(R) Xeon(R) CPU E5-2407
0 @ 2.20 GHz) each with 128 gigabytes of memory. Two other machines consisted of three
eight core processors (Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20 GHz) each, where one
machine has 128 GB of memory and the other has 192 GB of memory. Bowtie2 (version
2.1.0) was compiled with the Gnu Compiler Collection. Bismark (version 0.7.12) was
run with Perl (version 5.10.1) with the non-directional flag and default settings.
Blastn (version 2.2.28+) was run with default settings on a C-to-T converted reference
genome to map unmapped Bismark reads, where all of the reads had Cs converted to Ts.
Bismark reports unique mappings, which are mappings that have a unique best score,
ambiguous mappings, which are reads with locations that have multiple best scores,
and unmapped reads, which had scores that were too low or reads where a location couldnâ€™t
be assigned.

File processing and statistical calculation

Custom Python 2.7 19] scripts using BioPython 20] were created to process files and calculate statistics such as sequence entropy and
sequence length. File manipulation included randomly sampling from FASTA files.

Sequence entropy was calculated for each sequence in each of the mapping categories
(unique, ambiguous, and unmapped). Entropy for a sequence, s, is calculated by taking
the negative sum of the frequency of each base, f_b, times the logarithm of the frequency of each base. More formally,

Entropy gives a measure of sequence randomness 21]. This measure of sequence complexity was chosen since it is simple and fast to compute.
A sequence of all Ts will have an entropy of 0, but a sequence with 25 percent of
each base will have an entropy of 2. We hypothesized that entropy would affect mapping
quality. Highly random-seeming sequences should uniquely map while non-random sequences,
for example, a sequence with mostly Ts, will map ambiguously or be discarded since
the string of Ts could map to many portions of the genome that have Cs at indeterminate
locations in the string of Ts. Bisulfite treatment tends to reduces entropy since
most Câ€™s will be converted to Tâ€™s since cytosine methylation is usually rare.

For comparing the unique mapping efficiency for converted and unconverted reads, 50
bootstrap replicates with replacement were created from all of the mouse data (both
differentiating and undifferentiated). Each replicate had 6.46 million reads. Each
replicate had the T-enriched pair and the A-enriched pair so that the untreated original
sequence could be recovered. Reads that aligned exactly (that is, there was no sequencing
error) were extracted and mapped with Bismark on default settings for both the T-enriched
and the A-enriched reads. About 68% (4.4 million) of the reads in each replicate aligned
exactly. The original read was recovered and mapped with Bowtie2 using the same settings
as Bismark. Bismark reports a uniquely mapped percent, and the recovered sequence
mapping categories from Bowtie2 was used to calculate a uniquely mapped percent in
the same way as Bismark. Read mappings were done for all 50 bootstrap replicates so
that a p-value could be computed.

For a single bootstrap replicate for the day 0 data, reads were extracted that had
0, 1, 2, and 3 mismatches somewhere in the read. Mapping using Bismark was done for
all of them. Bismark mappings for reads with mismatches only in the first 25 bases
were compared with reads with mismatches only in the latter right-most part. This
comparison was done for 1, 2, and 3 mismatches.

With Illumina pair-end sequencing technology, a total of 8 â€œlanesâ€ of data was produced.
Lane 8 data was used throughout except in the two cases mentioned above that used
bootstrap replicates drawn from all of the data. Both T-enriched and A-enriched strands
were used for analysis, and the day 0 and day 6 data was used. The day 0 lane 8 data
has 32,434,798 reads and the day 6 lane 8 data has 58,034,817 reads.

To understand what the best possible bisulfite mapping efficiency was, a simulation
was performed where simulated bisulfite reads were sampled from the mouse reference
genome at random and read sequencing error was introduced. This simulation represented
the best possible mapping scenario since the reads did not include natural variation.
The simulation sampled sequences of length 100, 75, and 50 bp in the same proportions
that were in the lane 8 data and compared mapping efficiency to the real hairpin data
with sequences of only 100, 75, and 50 bases. Simulations with sequencing error of
1% and 10% were created. (Sequencing error set to less than 1% had very little difference
in the mapping efficiencies of 1% error.) The program Sherman from Babraham Bioinformatics
was used to do the simulation with methylation rates set to CpG = 80% and CPH = 2%.
These rates were chosen since they were consistent with the literature 15]. Both BSMap and Bismark were used to do the mapping.

To investigate the benefit of using different algorithms, Blast was run on default
settings on 64 percent of the bisulfite-treated reads unmapped by Bismark for the
T-enriched day 0 lane 8 data. The reads and the reference genome had all Câ€™s converted
to Tâ€™s. Only 64 percent of the reads were used since Blast took a long time to complete
and produced a lot of data.