HIV-1 and HIV-2 exhibit similar mutation frequencies and spectra in the absence of G-to-A hypermutation

Plasmids, cell lines, and reagents

The HIV-1 vector, pNL4-3 HIG, has been previously described 67]. This vector contains a cassette encoding heat stable antigen (HSA), an internal
ribosomal entry site (IRES), and enhanced green fluorescent protein (EGFP). The HIV-2
vector, pROD HIG, was created from pHIV-2 H0G 68], a kind gift from Dr. Wei-Shau Hu (HIV Drug Resistance Program, Frederick National
Laboratory for Cancer Research, Frederick, MD, USA). The pHIV-2 H0G vector does not
express Vpr or EGFP due to multiple point mutations. In order to construct pROD HIG,
the vpr and egfp genes were restored by site-directed mutagenesis using the QuikChange II XL kit (Agilent
Technologies, Inc.; Santa Clara, CA, USA), and the resulting vector was verified by
DNA sequencing. Both HIV-1 and HIV-2 vectors contain intact open reading frames for
all genes except for the env and nef genes. Vector viral stocks from pNL4-3 HIG were produced by co-transfecting with
pNL4-3 Env, a kind gift from Dr. Eric Freed (HIV Drug Resistance Program, Frederick
National Laboratory for Cancer Research, Frederick, MD, USA). Vector viral stocks
of pROD HIG were produced by co-transfecting pROD10-Env 69], a kind gift from Dr. Paula Cannon (University of Southern California, Los Angeles,
CA, USA). The human embryonic kidney (HEK 293T) cells were purchased from American
Type Culture Collection (Manassas, VA, USA) and maintained in Dulbecco’s Modified
Eagle’s Medium (DMEM) from Cellgro (Manassas, VA, USA) with 10% HyClone FetalClone
III (FC3) from Thermo Scientific (Waltham, MA, USA) and 1% penicillin/streptomycin
from Life Technologies (Grand Island, NY, USA). U373-MAGI-CXCR4
CEM
cells were obtained from Dr. Michael Emerman through the NIH AIDS Reagent Program,
Division of AIDS, NIAID, NIH 70]. U373-MAGI cells were maintained similarly to 293T cells but with addition of 1.0 ?g/mL
puromycin, 0.1 mg/mL hygromycin B, and 0.2 mg/mL G418 to the medium. For transfections,
poly-L-lysine was from Newcomer Supply (Middleton, WI, USA) and polyethylenimine (PEI) was
from Polysciences, Inc. (Warrington, PA, USA).

Virus production and titering

Virus was produced by co-transfecting pNL4-3 HIG or pROD HIG with pNL4-3 Env or pROD10
Env, respectively, into 293T cells via the PEI method 71]. PEI stocks were prepared at 1 mg/mL by dissolving PEI in water, adjusting the pH
to 7.0, and filtering through a 0.2 µm filter. 24 h before transfection, 4 million
293T cells/plate were seeded onto 10 cm plates pre-coated for 5 min with poly-L-lysine. For each plate, 10 µg of pNL4-3 HIG or pROD HIG + 5 µg Env expression plasmid + 45 µL
1 mg/mL PEI were combined with serum-free DMEM to a final volume of 1 mL. After 20 min
of incubation, the medium on the 293T cells was replaced and the DNA-PEI mixture was
added. The medium was replaced 16 h post-transfection, and viral stocks were collected
48 h post-transfection by filtering the supernatants through a 0.2 µm filter. For
each viral stock, five plates were transfected, and the resulting supernatants were
pooled and concentrated (~tenfold) using 100,000 MWCO filtration columns (Vivaproducts;
Littleton, MA, USA). Viral stocks were then treated with 10 U/mL of DNase I (New England
Biolabs; Ipswich, MA, USA) for 2 h at 37°C to degrade residual plasmid DNA from transfections.
Viral stocks were then divided into 1.0 mL aliquots and frozen at ?80°C.

Prior to large-scale infections, viral stocks were first titered in U373-MAGI cells
based on EGFP expression. The day before infection, 31,250 cells/well were plated
in 24-well plates. After 24 h, the media was replaced and varying volumes of virus
ranging from 15.625 to 500 µL (twofold dilution series) were added. To improve infectivity,
the cells were infected by spinoculation (1,200 × g for 2 h at 24°C). The media was
replaced again 24 h post-infection and cells were collected at 72 h post-infection
for analysis of EGFP expression by flow cytometry. The cells were treated with trypsin,
transferred to 96-well plates, pelleted at 500 × g for 5 min, and resuspended in 200 µL
Dulbecco’s phosphate-buffered saline (DPBS) + 2% FC3/well. EGFP expression from at
least 10,000 gated cells was analyzed using a BD LSR II flow cytometer (BD Biosciences;
San Jose, CA, USA). EGFP was excited with a blue 488-nm laser and emission detected
using 505LP and 525/50 filters. Virus titers (expressed as infectious units/mL) were
calculated based upon EGFP expression at low infectivities (40%) as previously described
72].

Infections for Illumina sequencing

In order to prepare samples for Illumina sequencing, 1 × 10
6
U373-MAGI cells were infected at a multiplicity of infection (MOI) of 1.0. These infections
were performed in 24-well plates (31,250 cells/well) in order to avoid any potential
effect of the plate format on infectious titer. Uninfected cells and cells infected
with heat-inactivated viruses (i.e. virus stocks that were incubated at 65°C for 1 h)
were included as negative controls. The cells were infected by spinoculation, and
the medium was replaced 24 h post-infection. Cells were collected for genomic DNA
extraction at 72 h post-infection by treating with trypsin, pelleting, and washing
three times with DPBS to further reduce plasmid carryover. Extra wells of infected
cells were analyzed by flow cytometry to verify infectivity. In this assay, all proviruses
result from a single cycle of infection, as neither producer cells nor target cells
can be re-infected, due to a lack of the appropriate receptor and co-receptor or to
a lack of envelope expression, respectively (Figure 1).

Genomic DNA extraction and quantification of plasmid carryover

Genomic DNA was extracted from all collected cells using the High Pure PCR Template
Preparation Kit (Roche; Basel, Switzerland) following the manufacturer’s instructions
and eluted in 150 ?L buffer. Genomic DNA was treated with DpnI for 1 h at 37°C to further reduce plasmid carryover from the transfections, after
which DpnI was heat-inactivated at 80°C for 20 min. In order to quantify any residual plasmid
carryover, two approaches based on quantitative PCR (qPCR) were adopted: (1) the ampicillin
resistance gene copy number was determined and divided by the proviral copy number
(as measured using the HSA Illumina amplicon), or (2) the proviral copy number from
heat-inactivated virus infections was determined and divided by the proviral copy
number from the corresponding un-treated infections. For both approaches, qPCR was
performed using 4 ?L water, 6.25 ?L 2× Power SYBR Green Master Mix (Life Technologies),
0.625 ?L each primer (500 nM final concentration), and 1 ?L template. The cycling
conditions used were an initial denaturation of 95°C 10 m, 40 cycles of 95°C 15 s/55°C
15 s/72°C 30 s, and a final extension of 72°C 7 min. The sequences of the primers
to the ampicillin resistance gene were 5?-ACTTTATCCGCCTCCATCCAGTC-3? and 5?-GAGCGTGACACCACGATGC-3?.
Absolute standard curve series (from 10
1
to 10
6
copies/?L) were constructed by quantifying the pNL4-3 HIG and pROD HIG plasmids with
the Qubit dsDNA HS Assay Kit (Life Technologies). We found that the level of plasmid
carryover for HIV-1 was 0.2% when measured by either method, while the level of carryover
was 2.8 or 1.4% for HIV-2, depending on the approach used (Additional file 1: Table S1). These results are comparable to those obtained in a similar study of
HIV-1 mutagenesis 5] and are too low to significantly affect measured mutation frequencies.

Amplifications for Illumina sequencing

PCR was performed on five small (~150–170 bp) amplicons (Gag, Vif, HSA, EGFP-1, EGFP-2)
for each sample, with HIV-1 and 2 amplicons positioned in homologous locations. All
forward primers contained 5-base barcodes (differing by at least two bases) to allow
demultiplexing of pooled PCR products. All primer and barcode sequences are listed
in Additional file 11: Table S6. Barcodes were randomly generated using a program written by Luca Comai
and Tyson Howell (http://comailab.genomecenter.ucdavis.edu/index.php/Barcode_generator) 73]. PCR was performed using the Phusion Hot Start II High-Fidelity DNA Polymerase (Fisher
Scientific; Pittsburgh, PA, USA). PCR reactions were performed with 8 µL of genomic
DNA (~50,000 target copies), 500 nM of each primer, and a final volume of 50 µL/reaction.
The cycling conditions used were an initial denaturation of 98°C 30 s, 30 cycles of
98°C 10 s/56°C 30 s/72°C 15 s, and a final extension of 72°C 10 min. All PCR reactions
were performed in triplicate to reduce the risk of clonal amplification, and the products
were pooled after amplification. Negative control reactions were performed lacking
template or with genomic DNA from uninfected cells.

In order to assess the degree of background error due to PCR and Illumina sequencing,
control amplifications were performed from pNL4-3 HIG or pROD HIG plasmids using the
same cycling conditions. For these reactions, 50,000 copies of plasmid per 50 µL reaction
was used, a target level found via qPCR to be similar to that of the other samples.
Genomic DNA (8 µL/reaction) from uninfected cells was added to the plasmid PCR reactions
to account for any potential impact of genomic DNA on amplification. Further, the
degree of intra-sample PCR-mediated recombination was investigated in the plasmid
controls, as recombination could affect the association between mutations and thus
alter hypermutant frequencies. To determine recombination frequencies, EGFP-1 was
amplified from a 1:1 mixture (i.e. 25,000 copies each) of wild-type and mutant EGFP-1
sequences. The mutant EGFP-1 sequence contained two genetic markers positioned ~50 bp
apart, and each marker consisted of two mutations to facilitate distinction from PCR
and sequencing-induced errors. Intra-sample recombinants were detected by identifying
read pairs from the plasmid controls with a single marker set, rather than the expected
zero or two marker sets present in the wild-type and mutant EGFP-1 sequences, respectively.
Importantly, all mutations utilized as recombination markers were masked before mutational
analysis.

Library preparation and Illumina sequencing

Amplicons were gel-purified using the Promega SV Gel Extraction Kit (Promega Corp.;
Madison, WI, USA). After gel purification, all amplicons were quantified using the
Qubit dsDNA HS Assay Kit (Life Technologies) and a Qubit 2.0 fluorometer. For each
sample, all amplicons were pooled together in an equimolar fashion to normalize coverage
between amplicons. Next, 100 ng of each sample (12 samples in total) were submitted
to the University of Minnesota Genomics Center for library preparation. The libraries
were constructed with the TruSeq Nano DNA Sample Preparation Kit following the manufacturer’s
instructions but using AMPure XP beads (Beckman Coulter, Inc.; Indianapolis, IN, USA)
at a bead to sample ratio of 1.8. The 15 libraries were again quantified using the
Qubit dsDNA BR Assay Kit, size-confirmed with Agilent DNA 1000 chips (Agilent Technologies;
Santa Clara, CA, USA), and pooled in an equimolar fashion to normalize coverage between
libraries. Samples were pooled after library preparation, rather than before, because
it had been previously determined that significant inter-sample recombination can
occur during the 10 cycles of PCR typically utilized in library construction (data
not shown). In order to improve sequence diversity and quality, a PhiX library was
added in at ~25% of total mass. Next, 2 ?L of the 10 nM library pool was denatured
and diluted to a final concentration of 4 pM for DNA sequencing. Sequencing of proviral
DNA was conducted by using the Illumina MiSeq with 2 × 150 paired-end sequencing.
All Illumina sequencing data supporting the results of this manuscript have been deposited
into the NCBI Sequence Read Archive (SRA) under accession code BioProject PRJNA287455.

Processing of proviral DNA sequencing data

First, paired-end reads from the Illumina MiSeq reaction were demultiplexed based
on perfect barcode matches, and barcode sequences were trimmed off during the process.
Second, poor quality reads were filtered out using quality criteria found to reduce
Illumina background error rates 35]. Specifically, read pairs were discarded in which either read contained a B-tail
(i.e. one or more low quality bases at the end of the read), contained at least one
uncalled base, had less than two-thirds of bases with Q-score ?30 in the first half
of the read, or had an average Q-score 30 in the first 30% of the read. Third, reads
from each of the 12 samples were mapped to the appropriate reference sequence (pNL4-3
HIG or pROD HIG) with GSNAP 74] using default parameters. Finally, a small number of read pairs (~35,000) that were
aligned either partially or fully outside of the appropriate amplicon regions were
excluded. For each sample, the numbers of initial read pairs, read pairs lost during
mapping or filtering, and final read pairs are listed in Additional file 2: Table S2. We obtained ~4.7 million total read pairs after all processing steps,
which removed primarily PhiX read pairs or HIV read pairs with imperfect barcodes
or poor quality. About 319,000–461,000 read pairs were obtained per sample (average
of 395,000 read pairs/sample), or 46,000–111,000 read pairs per amplicon per sample
(average of 79,000 read pairs/amplicon/sample).

In order to identify mutations present in read pairs passing the above processing
steps, a custom algorithm was developed to compile mutation frequency data for each
sample, built using the Genome Analysis Toolkit (GATK) walker framework 75]. This algorithm determines both the frequency of total mutations as well as of specific
mutational types (substitutions, transitions, transversions, every type of transition
or transversion, insertions, deletions, etc.). In order to minimize background error
rates, mutations were required to be identified on both sequences in a read pair,
which was possible because forward and reverse reads were mostly overlapping due to
small amplicon sizes (~150–170 bp). Furthermore, substitutions and insertions were
only counted if they had a Q-score ?30 for the relevant base(s) on both reads. Wild-type
bases had to meet the same criteria as mutations (i.e. identified as wild-type and
Q-score ?30 on both sequences of a read pair). Non-overlapping segments of read pairs
as well as single reads were excluded from mutational analyses. All primer sequences
were also excluded from mutational analysis, as errors within primers would not represent
biologically meaningful mutations. We also examined the distribution of all background
errors (from PCR and sequencing) in the plasmid controls and identified numerous plasmid
error hotspots (defined as upper outliers within the distribution of frequencies of
individual mutations based on the 1.5 × IQR rule). Most plasmid error hotspots (~83%)
were G-to-T or C-to-A transversions. Within identical amplicons (HSA/EGFP-1/EGFP-2),
many common plasmid error hotspots were identified in the HIV-1 and HIV-2 plasmid
controls (~88% overlap), whereas the degree of overlap was much lower for the ~60%
homologous viral amplicons (~49% overlap in Gag and ~31% overlap in Vif). Plasmid
error hotspots that were shared between the HIV-1 and HIV-2 plasmid controls (Additional
file 10: Table S5) were masked before further downstream analysis, as the presence of these
mutations in the biological samples would most likely represent background errors.
Rather than masking all mutational types at error hotspots, only the problematic type(s)
(e.g. G-to-T) were masked at the indicated positions. Insertions and deletions were
scored as single events regardless of the number of bases inserted or deleted. Mutation
frequencies (defined as mutations/bp) were calculated by dividing the number of mutations
passing filters by all reference bases (mutations + wild-type bases) passing filters.

In addition to examining mutation frequencies and spectra, hypermutants were identified
and characterized within the Illumina sequencing data. Using a custom GATK walker,
hypermutant counts for each type of transition were collected (G-to-A, A-to-G, T-to-C,
and C-to-T). Hypermutants were defined as individual read pairs containing two or
more of the same type of transition. All transitions had to be identified on both
sequences with Q-scores ?30, and a single read pair could theoretically count as two
different types of hypermutants if it contained multiple instances of two different
transition types. Hypermutant frequencies were calculated by dividing the number of
hypermutant read pairs by all (hypermutant + non-hypermutant) read pairs that passed
the processing steps. In order to examine G-to-A hypermutation hotspots, ranked lists
of G-to-A mutation frequencies were generated at individual bases within G-to-A hypermutants.
Hypermutation hotspots were then defined as statistical upper outliers within the
distribution of frequencies using the 1.5 × IQR rule.

Biostatistical analysis of Illumina DNA sequencing data

To test for factors that may influence mutation frequencies, generalized linear mixed
effects models were applied to the data that came from our Illumina data processing
pipeline. The raw counts for each type of mutation (e.g. transitions) were modeled
as Poisson random variables with an offset given by the total number of reference
bases. The type of sample (i.e. HIV-1, HIV-2, and plasmid controls for HIV-1 and HIV-2),
the type of amplicon, and their interactions were treated as fixed effects while the
replicate was treated as a random effect. The logarithmic link was used, as is standard
for Poisson outcomes, and penalized quasilikelihood was used to estimate the model
parameters 76]. These computations were conducted using R v 3.1.0 and the MASS package. All figures
and tables were created in GraphPad Prism v 6.0 (GraphPad Software, Inc.; La Jolla,
CA, USA) or Microsoft Office for Mac 2011 v 14.3.8 (Microsoft Corp.; Redmond, WA,
USA).