High-coverage sequencing and annotated assembly of the genome of the Australian dragon lizard Pogona vitticeps

The central bearded dragon, Pogona vitticeps, is widespread through the arid and semi-arid regions of eastern central Australia.
This lizard adapts readily to captivity, lays large clutches of eggs several times
per season, and is kept as a favoured pet species in Europe, Asia and North America.
The karyotype of P. vitticeps is typical of most Australian agamids, consisting of six pairs of macrochromosomes
and ten pairs of microchromosomes (2n?=?32) 1]. The sex determining mechanism is one of female heterogamety (ZZ/ZW) and the sex
chromosomes are a pair of microchromosomes 2]. Sex determination, a primary driver for our interest in generating this genome sequence,
is complex in this species, involving an interaction between the influences of incubation
environment and the ZZ/ZW genotype 3], 4].

Samples and sequencing

DNA samples were obtained from a blood sample taken from a single male Pogona vitticeps (Fabian, UCID 001003387339) verified as a ZZ male using sex-linked polymerase chain
reaction (PCR) markers 3] and cytological examination 2]. This work was undertaken in accordance with the Australian Capital Territory Animal
Welfare Act 1992 and the approval of the University of Canberra Animal Ethics Committee.
DNA was extracted and purified using standard protocols and transported to BGI-Shenzhen,
China for sequencing. 13 insert libraries were constructed with insert sizes of 250 bp,
500 bp, 800 bp, 2 kbp (x2), 5 kbp, 6 kbp, 10 kbp (x2), 20 kbp (x2) and 40 kbp (x2) and subjected to paired-end sequencing on an Illumina HiSeq 2000 platform to generate
317 Gbp of raw sequence (Table 1). After filtering for low-quality reads and duplicated reads arising from PCR amplification
during library construction, 146.38 Gbp of data were retained for genome assembly.
This amount of data represents an average read depth of 82.7 (Table 1), assuming a genome size of 1.81 pg, as estimated for a female P. vitticeps by flow cytometry 5]. This mass converts to a genome size of 1.77 Gbp 6].

Table 1. Summary of sequencing data derived from paired-end sequencing of 13 insert libraries
using an Illumina HiSeq 2000 platform

Reads from the short-insert libraries (250, 500 and 800 bp) were decomposed into short
sequences of length k (k-mers, with k?=?17) using Jellyfish version 1.1 7]. The histogram of k-mer copy number (Fig. 1) was strongly bi-modal, the first mode with a copy number that was half that of the
second, which reflects the high level of heterozygosity in this wild-caught lizard
(0.85 % of single nucleotide polymorphisms [SNPs] plus sequence insertions or deletions
[indels]). The second mode in the k-mer graph was used to obtain an estimate of the
genome size using the formula:

where L is the read length (125 bp), k is the k-mer length (17 bp), there are 98.35
Gbp of sequence data, and the mode is taken from the k-mer graph (48.5, Fig. 1). Our sequence-based estimate of 1.768 Gbp agrees well with the estimate of 1.77
Gbp for the ZW genome that was previously made using flow cytometry data 4].

Fig. 1. K-mer spectrum for the genome sequence of a male Pogona vitticeps (ZZ). Sequencing DNA derived from the short-insert libraries (250, 500, 800 bp) yielded
98.35 Gbases of clean data in the form of 125 bp reads, which generated 76.89×109
17-mer sequences. The solid line shows the k-mer spectrum (percentage frequency against
k-mer copy number). The second mode (copy number 48.5) represents homozygous single
copy sequence, whereas the first mode (24.5), half the copy number of the first, represents
heterozygous single copy sequence. Heterozygosity is high, which complicated assembly

Assembly

Assembly was performed with SOAPdenovo (version 2.03) 8], 9]. Briefly, the sequences derived from the short-insert libraries were decomposed into
k-mers to construct the de Bruijn graph, which was simplified to allow connection of the remaining k-mers into contiguous
sequence (contigs). We then aligned the paired-end reads from small and large-insert
library sets to the contigs, calculated the support for relationships between contigs,
assessed the consistent and conflicting relationships, and constructed scaffolds.
Finally, we retrieved paired reads that mapped to a unique contig but had the other
member of the pair located in a gap region. Reads falling in the same gap region were
then assembled locally. The final assembly (European Nucleotide Archive [ENA] accession
number ERZ094017) yielded a contig N50 of 31.3 kbp and a scaffold N50 of 2.3 Mbp (N50
meaning that 50 % of the genome sequence is contained in contigs, or scaffolds, equal
to or greater than this length), with unclosed gap regions representing only 3.78 %
of the assembly (Table 2).

Table 2. Statistics for the assembly contigs and scaffolds (after gap filling)

Reads from small-insert libraries that satisfied our filtering criteria were aligned
to the assembly using the Burrows-Wheeler Aligner (BWA, version 0.5.9-R16) 10], allowing for eight mismatches and one indel. Of the total number of reads (797.4 M),
96.4 % could be mapped back to the assembled genome and they covered 98.4 % of the
assembly excluding gaps. Bases in the assembled scaffolds had, on average, reads mapped
with 55X read depth. These data suggest that we have assembled most of the P. vitticeps genome. In addition, we used the CEGMA package (version 2.4) 11] to map 248 core eukaryotic genes to our P. vitticeps assembly. Our assembly captured 99 % (246 of 248) of the core CEGMA genes, with 93 %
(231) being complete. This is a higher assembly rate than that estimated for the green
anole Anolis carolinensis assembly (AnoCar2.0), which captured 93.6 % (232) of the core genes, with 85.9 %
(213) being complete.

Transcriptomes

We generated transcriptome data from the brain, heart, lung, liver, kidney, skeletal
muscle and gonads of male and female P. vitticeps (Table 3). None of the seven animals from which we collected RNA was used in generating the
genome sequence. Two sets of sequencing runs on two different male and female individuals
were performed by BGI-Shenzen, producing 309,436,077 90 bp paired-end reads (ENA accession
numbers ERR753524-ERR753530 and ERR413064-ERR413076). A third set of samples was sequenced
by The Ramaciotti Centre, University of New South Wales, Australia, including a sex-reversed
ZZ female, producing 89,687,526 101 bp paired-end reads (ENA accession numbers ERR413077-
ERR413082). We assembled these datasets (from all seven individuals) into 595,564
contigs using Trinity (release r2013_08_14) 12] with default parameters (ENA accession number ERZ097159). Only the first set of RNA-seq
reads was available for genome annotation (ENA accession numbers ERR753524-ERR753530)
but we make the entire dataset, including our de novo assembly, available with this article (see ‘Availability of supporting data’ section).

Table 3. Number of predicted genes with RNA-seq signals

Annotation

Transposable elements and other repetitive elements were identified using a combination
of homology, at both the DNA and protein levels, and de novo prediction. In the homology-based approach, we searched Repbase 13] for known transposable elements, used RepeatMasker 14] for DNA homology search against the Repbase database, and used WuBlastX to search
against the transposable element protein database provided within RepeatProteinMask
(bundled in RepeatMasker). In the de novo approach, we used RepeatModeler 15] and LTR_FINDER 16] to predict repetitive elements. Tandem repeats were identified using Tandem Repeats
Finder 17]. The relative success of the different approaches is shown in Table 4. Overall, we identified about 690 Mbp of repetitive sequences accounting for 39.47 %
of the genome, of which the predominant elements were long interspersed nuclear elements
(LINEs, which accounted for 33 % of repetitive sequences representing 12.2 % of the
genome) (Table 5).

Table 4. The statistics for repeats in the P. vitticeps genome annotated by different methods

Table 5. Breakdown of repeat content of the Pogona vitticeps genome derived from RepeatMasker analysis

We combined homology-based, de novo and transcriptome-based methods to predict gene content of the assembly. In the homology-based
prediction, the assembly was annotated by generating reference sets of A. carolinensis, Gallus gallus and Homo sapiens proteins, and aligning the reference sets to the assembly using TBLASTN (version
2.2.23; E-value???1?×?10
?5
). The resultant homologous genome sequences were then aligned against matching proteins
using Genewise (version wise2-2-0) 18] to define gene models. In the de novo prediction, we randomly selected 1,000 genes with intact open reading frames (ORFs)
as predicted by the homology-based approach to train the Augustus gene prediction
tool (version 2.5.5) 19] with the parameters appropriate to P. vitticeps. The de novo gene prediction was then performed with Augustus applied to the genome after repeat
sequences were masked as described above. In the transcriptome-based approach, we
mapped transcriptome reads to the assembly using TopHat (version 1.3.1) 20], which can align reads across splice junctions. These mapped reads were assembled
into transcripts using Cufflinks (version 1.3.0) 21] and then merged across samples (n?=?7, Table 3) into a single transcriptome annotation using the Cuffmerge option.

The results of the three approaches were combined into a non-redundant gene set of
19,406 protein-encoding genes, 63 % of which included intact ORFs (Table 6). Most of the predicted genes were supported by RNA-seq signals (Table 3).

Table 6. Characteristics of predicted protein-coding genes in the Pogona vitticeps assembly and comparison with Anolis carolensis, Gallus gallus and Homo sapiens

To assign gene names to each predicted protein-coding locus, we mapped the 19,406
genes to an Ensembl library collated from A. carolinensis, chicken G. gallus, human H. sapiens, western clawed frog Xenopus tropicalis and zebrafish Danio rerio. The name associated with the best hit for each P. vitticeps gene was assigned to each of 19,083 genes. Most of these genes (16,510) mapped to
a homolog even at high stringency (80 % of protein length aligned).

Bacterial artificial chromosome library

A large-insert genomic DNA bacterial artificial chromosome (BAC) library was constructed
from DNA from a wild-caught female dragon lizard (TC1542) confirmed to have the ZW
genotype using sex-linked PCR markers 3], 4] and cytologically 3]. The library is estimated to represent 6.3× of genome coverage, and is comprised
of 92,160 clones with an average insert size of 120 kbp. This resource is commercially
available through Amplicon Express (Pullman, WA, USA; http://ampliconexpress.com).

Anchoring sequences to chromosomes

Our previously published cytogenetic map of P. vitticeps consisted of 87 BACs that were mapped to the macrochromosomes (64 BACs) and microchromosomes
(23 BACs) 1]. We mapped an additional 80 BACs, extending the set to 125 markers on macrochromosomes
and 42 on microchromosomes. Sequence scaffolds were anchored to chromosomes by 52
loci, contained in the BACs, that are conserved in homologous syntenic blocks across
amniotes (A. carolinensis, G. gallus, H. sapiens). By using gene synteny information 37.9 % (670 Mbp) of the sequenced genome has
been assigned to chromosomes (Deakin et al., unpublished data).

Sex chromosome sequences

The sex of P. vitticeps is determined by a combination of chromosomal constitution and influence of environmental
temperature on the developing embryo. P. vitticeps has female heterogamety (with ZZ male and ZW female individuals), and the Z and W
chromosomes are among the ten pairs of microchromosomes 2]. Sex chromosome heteromorphy is evident by C-banding, but the degree of differentiation
of the Z and W chromosomes is slight 2]. The sex chromosomes of P. vitticeps are not homologous to the sex chromosomes of chicken (G. gallus) or other reptiles so far examined 22]. The ZZ genotype is reversed to a female phenotype at high incubation temperatures
3], 4].

Our laboratory has previously identified a sex-linked sequence using amplified fragment
length polymorphism screening and genome walking 4], 23]. Five contiguous BAC clones containing sex-linked markers that map to the sex chromosome
pair were sequenced to reveal 352 kbp of P. vitticeps sex chromosome sequence 24]. This region contained five protein-coding genes (oprd1, rcc1, znf91, znf131 and znf180) and several major families of repetitive sequences (long terminal repeat [LTR] and
non-LTR retrotransposons, including chicken repeat 1 [CR1] and bovine B LINEs [Bov-B
LINEs]) 1], 24].

More recently, we amplified micro-dissected W-chromosome fragments to yield many sex
chromosome sequence tags that were reciprocally mapped to their Z homologs (Matsubara
et al., unpublished data). All putative sex chromosome scaffolds were confirmed to
co-localize with the known ZW-BAC Pv3-L07 when physically mapped (Deakin et al., unpublished
data). In this way we identified 12.8 Mbp of the Z chromosome (on three scaffolds)
and increased the number of confirmed sex chromosome genes to 240 (Deakin et al.,
unpublished data).

GC content and isochore structure

We investigated patterns of GC content variation in the P. vitticeps genome using two approaches. First, we examined the absolute GC content in non-overlapping
5 kbp windows for several genomes (P. vitticeps, A. carolinensis25], Burmese python Python bivittatus26], king cobra Ophiophagus hannah27], western painted turtle Chrysemys picta28],Chinese softshell turtle Pelodiscus sinensis29], saltwater crocodile Crocodylus porosus30] chicken G. gallus, mouse Mus musculus, domestic dog Canis familiaris31] and western clawed frog X. tropicalis32]; Table 7; Fig. 2). We then examined variation in GC composition for these same genomes at increasing
spatial scales (5, 10, 20, 40, 80, 160 and 320 kbp windows; Fig. 3). We also looked at different subsets of the P. vitticeps genome, including macrochromosomes and microchromosomes, and the Z chromosome (Fig. 4a), by restricting the analysis to scaffolds that have been physically mapped (Deakin
et al., unpublished data).

Table 7. Comparison of mean GC content for available tetrapod genomes

Fig. 2. Distribution of GC content in 5 Kbp windows for a range of vertebrates including Pogona vitticeps

Fig. 3. Variation in GC content among windows for various genome sequences with increasing
window size (5, 10, 20, 40, 80, 160, and 320 Kb windows). The relationship for Pogona vitticeps is disaggregated to macrochromosomes, microchromosomes and the Z sex chromosome for
comparison. Scale of X axis is natural logarithm. Pogona macrochromosomes share the lack of isochore structure reported for the Anolis genome?

Fig. 4. Analysis of GC content in Pogona vitticeps. a, Distribution of GC content in all chromosomes, macrochromosomes, microchromosomes
and the Z chromosome, calculated with a non-overlap 5-kb sliding windows ; b, GC content of various components of the genome, in comparison with the average GC
content for macrochromosomes (red line), microchromosomes (green line) and the Z chromosome
(blue line) ; c, GC content of the macrochromosomes, microchromosomes and Z chromosomes broken down
for various components of the genome

The macrochromosomes of P. vitticeps are largely devoid of variation in GC content at small (5 kbp) spatial scales. In
fact, P. vitticeps macrochromosomes are more uniform in terms of GC distribution than is the A. carolinensis genome (standard deviation 0.029 versus 0.032 respectively; Table 7). With the exception of the Z microchromosome, P. vitticeps microchromosomes possess a heterogeneous distribution of GC-rich sequences over 5
kbp windows (Fig. 4a). In this regard, P. vitticeps microchromosomes resemble those of birds but differ markedly from those of A. carolinensis, whose GC content more closely resembles that of the macrochromosomes 33]. Intriguingly, the Z microchromosome of P. vitticeps has an average GC content comparable to that of coding sequences and short interspersed
nuclear elements (SINEs) (Fig. 4b), which suggests that this chromosome may be enriched in these GC-rich components
of the genome. However, with the exception of LTR transposable elements, all components
(CDS, introns, tandem repeats and transposable elements) showed greater GC content
if they resided on the Z chromosome than elsewhere (Z chromosome microchromosomes
macrochromosomes; Fig. 4c), suggesting that there are other, as yet unidentified, reasons for the observed
variation in GC content across different chromosome classes.

When variation in GC distribution is considered over larger spatial scales (tens to
hundreds of kbp, Fig. 3), the P. vitticeps macrochromosomes are similar to the A. carolinensis genome, which lacks substantial variation in GC composition, a striking departure
from isochore patterns seen in mammals and birds 33]. The Z chromosome, too, lacks substantial heterogeneity over larger spatial scales,
which perhaps reflects the uniform distribution of repetitive elements along its length
24]. Only the autosomal microchromosomes of P. vitticeps bear any similarity in GC distribution to the other sauropsid genomes examined. The
P. vitticeps genome, therefore, has compositional patterns distinct from that of A. carolinensis, which indicates that different processes have shaped the genomes of the two lizards
since they shared a common ancestor 144 million years ago.

Comparison with other assemblies

P. vitticeps and A. carolinensis had similar scaffold N50 values (2.29 Mbp and 4.03 Mbp, respectively). These values
for P. vitticeps are surprisingly good, as its genome was assembled from short read sequences, whereas
that of A. carolinensis was generated using Sanger sequencing. Our assembly compares well to nine other sauropsid
genomes, including those of two squamates, two turtles and three crocodilians (Table 8).

Table 8. Comparison of sequencing platform, assembler, and assembly statistics for the reptiles
for which a genome sequence is available

The gene parameters listed in Table 6 compare well to those of other vertebrates (see also Fig. 5).

Fig. 5. Comparisons of gene parameters among Pogona vitticeps, Gallus gallus, Python bivittatus, Anolis carolinensis, and Pelodiscus sinensis genomes

Concluding remarks

The quality of the P. vitticeps assembly is comparable to that of other published squamate genomes. This genome assembly,
coupled with the availability of a BAC library and the development of a high-density
physical map for each chromosome, provides an unparalleled resource for accelerating
research on sex determination, major histocompatibility complex evolution, and the
evolution of adaptive traits in squamates to complement the advances brought about
by the sequencing of the A. carolinensis genome 25].