Identification and characterization of expressed retrotransposons in the genome of the Paracoccidioides species complex


Clustering, similarity and functional annotation of Paracoccidioides EST sequences

As shown in the workflow in Figure 1, the first step in the identification of active retrotransposons in the genomes of
the Paracoccidioides species complex was to access the information contained in the EST database (http://www.ncbi.nlm.nih.gov/genbank/dbest/). A local database was built with 41,558 downloaded Paracoccidioides ESTs, which were then clustered, resulting in 12,922 sequence clusters distributed
in 4,812 contigs and 8,110 singlets. The EST clusters were compared with sequences
deposited in three databases: NR (the NCBI non-redundant protein database), TEfam
and Repbase (Figure 1). Approximately 20% of the clusters (2,544/12,922) showed no similarity with any
sequences in the NR database with the parameters set (see Methods section); Blast
hit descriptions were analyzed based on a lexical search approach using specific keywords
resulting in a first set of 142 EST sequences of putative retrotransposons (Figure 1). Further searches against specific databases identified 809 EST sequences common
to the TEfam and Repbase databases (6.2% of all clusters) (Figure 1). A set of 52 EST sequences that were found to be common between the results from
similarity and lexical searches were mapped in the Paracoccidioides genomes.

Figure 1. Consolidated workflow for the identification of Paracoccidioides EST sequences harboring
retrotransposons. All EST sequences (41,558) available at dbEST (release 110111) for
download from Paracoccidioides species were used to build a local database. The clustered
approach adopted (reciprocal BLAST and CAP3) resulted in 12,922 clusters, which were
used for further similarity searches against different databases (NR of proteins from
NCBI, Repbase and TEfam). Blast hit descriptions were analyzed based on a lexical
search approach (for details see Methods) and transposons matching rules suggested
by Wicker and coworkers 2] for classifying eukaryotic TEs were adopted. A set of 52 EST sequences were found
to be common between the results from similarity and lexical searches.

Identification and characterization of Paracoccidioides retrotransposons

Employing the set of 52 EST clusters as queries against the Paracoccidioides genomes, five genomic sequences with characteristics of retrotransposons were identified
and are referred to henceforth as RtPc (Retrotransposon Paracoccidioides complex) 1 to 5 (Table 1). All RtPc elements showed similarity with retrotransposons (Class I) (Table 1). Twenty-seven EST clusters were anchored to these five genomic sequences, each corresponding
to a putative retrotransposon, distributed as follows: 18 clusters mapped to RtPc1,
three to RtPc3 and RtPc5, two to RtPc4 and only one to RtPc2 (Additional file 1). The remaining 25 EST groups were mapped to genomic sequences with a large number
of stop codons, preventing identification of complete copies of retrotransposons.

Table 1. Classification of retrotransposons identified inParacoccidioidesgenomes

Similarity searches were conducted to locate and perform the functional and structural
annotation of intact, full copy elements in all supercontigs of P. brasiliensis (isolates Pb18 and Pb03) and P. lutzii (isolate Pb01), the three Paracoccidioides genomes (http://www.broadinstitute.org) 21]. Consensus sequences were generated by alignment of all intact copies of each retrotransposon
(Additional file 2). We found that there is at least one intact copy of each RtPc element in one or
more sequenced Paracoccidioides genome. The elements were classified (Table 1) based on the results of alignments with sequences of retrotransposons from the GIRI
database (http://www.girinst.org/censor/index.php) using the criteria proposed by Wicker et al. 2] and Kapitonov et al. 24] (http://www.girinst.org/RTphylogeny/RTclass1/) 24]. We identified four RtPc elements belonging to the LTR order from the Gypsy (LTR-Gypsy-RtPc1 and LTR-Gypsy-RtPc2) and Copia (LTR-Copia-RtPc3 and LTR-Copia-RtPc4) superfamilies, and one element is a LINE retrotransposon (LINE-Tad-like-RtPc5)
(Table 1, Figure 2) that we initially identified as a new type of fungal non-LTR retrotransposon related
to the Tad clade.

Figure 2. Structure and organization of RtPc elements in the Paracoccidioides complex. Schematic representations of complete RtPc elements are shown. The LTRs
are represented by blue arrowheads and PBS/PPT by black bars. The domains are represented
as follows: zinc finger – light purple; protease (PR) – green; reverse transcriptase
(RT) – yellow; RNase H (RH) – red; integrase (IN) – blue; chromodomain (CH) – orange;
and endonuclease (EN) – pink. The ORF with its respective size is represented above
each element. Figures are not to scale.

LTR retrotransposons of the superfamily Gypsy (LTR-Gypsy-RtPc)

The Gypsy elements identified in Paracoccidioides complex were 5412 to 5740 bp in length, with the coding regions flanked by LTRs.
Primer binding sites PBS and PPT were also identified at the 5? and 3? ends of each
element, respectively (Additional file 3, Figure 2). The element LTR-Gypsy-RtPc1 (5740 bp) has two ORFs corresponding to the gag gene, which is predicted to
encode the proteins of the virus-like particle (VLP), and the pol gene, which encodes
a polyprotein that gives rise to RNase H-reverse transcriptase, integrase and protease
after processing. The Pol ORF was in a ?1 frameshift in relation to the gag ORF. Coding
sequences were flanked by 246-bp LTRs, with the nucleotides TG (initial) and CA (final).
The first ORF (gag gene) encodes a 396-amino-acid (aa) protein with a conserved zinc
finger domain (17 aa), and the second encodes a polyprotein that displays the following
conserved domains: protease (99 aa), reverse transcriptase (176 aa), RNase H (126
aa), integrase (114 aa) and a chromodomain (52 aa). Full copies of LTR-Gypsy-RtPc2 (5412 bp) showed a single ORF that encodes a polyprotein with POL domains consisting
of protease (96 aa), reverse transcriptase (178 aa), RNase H (123 aa) and integrase
(113 aa). LTR-Gypsy-RtPc1 and LTR-Gypsy-RtPc2 shared 64.9% and 64.4% nucleotide sequence similarity, respectively, with retroelements
identified in the fungus Aspergillus nidulans (Table 1).

LTR retrotransposons from the Copia (LTR-Copia-RtPc) superfamily

The Copia elements identified in the Paracoccidioides complex were 5181 to 5685 bp long with the coding regions flanked by LTRs. The primer
binding sites PBS and PPT were also identified at the 5? and 3? ends of each element,
respectively (Additional file 3, Figure 2). Comparison with conserved domain databases showed that the element LTR-Copia-RtPc3 has a single ORF that encodes a 1528-aa protein with three conserved domains:
integrase (126 aa), reverse transcriptase (253 aa) and RNase H (143 aa). The LTR-Copia-RtPc4 element had a single ORF with fused gag and pol sequences predicted to encode
a 606-aa polyprotein with the following domains: Gag (87 aa), integrase (114 aa),
reverse transcriptase (245 aa) and RNase H (160 aa). The elements LTR-Copia– RtPc3 and RtPc4 shared 68.6 and 62.6% nucleotide sequence similarity with retrotransposons
identified in Drosophila bipectinata and in the fungus Coccidioides posadasii, respectively (Table 1).

A Non-LTR retrotransposon similar to the Tad clade elements (LINE-Tad-RtPc5)

The complete element LINE-Tad-like-RtPc5 (5905 bp) contains two separate ORFs: the first encodes a protein (572
aa) with no similarity to any known protein, and the second, with a frameshift in
relation to the first one, encodes a protein (1312 aa) that has conserved endonuclease
(222 aa), reverse transcriptase (267 aa) and RNase H (145 aa) domains. This element
has no LTRs, but a region corresponding to a poly A tail was identified (Additional
file 3, Figure 2). LINE-Tad-RtPc5 shared 63.9% nucleotide sequence similarity with a retrotransposon identified
in the fungus Blumeria graminis (Table 1). To assign this element to a specific clade, an automated tool (RtClass1, http://www.girinst.org/RTphylogeny/RTclass1/) 24] was employed that uses phylogenetic analysis of the RT domain protein. Based on the
results from the RTclass1 tool, our LINE element clustered together with the Tad1 clade, although the RtPc5 reverse transcriptase was indicated as belonging to an
outgroup clade. In an attempt to find better information on non-LTR retrotransposons
related to fungi, we uncovered an interesting report 25] in which the authors employed an in silico approach to survey the non-LTR retrotransposons in 57 fungal genomes, reporting more
than 100 novel non-LTR retrotransposons and, importantly, describing two new clades,
Inkcap and Deceiver. P. brasiliensis isolate Pb01 (now P. lutzii) was listed among the species searched by the authors, and three novel Tad-like elements, identified as PbNLR1, PbNLR2 and PbNLR3, were reported among the novel
non-LTR elements harbored by this genome. Based on the phylogeny of the RT domains,
the authors classified these novel P. brasiliensis non-LTR elements in the Tad clade under distinct families: CgT, Ask1 and Ask2. On the basis of sequence identity, it was possible to establish the identity of
the LINE-Tad-RtPc5 described here with the PbNLR1 element cited by 25], which is a Tad-CgT element.

Distribution of retrotransposons in the Paracoccidioides species complex

After identifying at least one complete copy of each of the five elements, the RtPc
sequences were used to identify intact and truncated copies in the sequenced Paracoccidioides genomes (Table 2). Overall, 538 copies of RtPc elements were found scattered throughout the genomes
of the three isolates. The majority of RtPc elements (54.46%) were identified in P. lutzii (Pb01), and the remaining 245 copies were found in P. brasiliensis isolates (17.28% and 28.26% in Pb03 and Pb018, respectively) (Table 2). The distribution of copies of each element in the sequence supercontigs is shown
in Additional file 4. Out of 538 retroelements identified in the Paracoccidioides genomes, 514 (95.54%) were truncated and only 24 (4.46%) were intact (Table 2). The distribution of truncated forms in Paracoccidioides species was as follows: 284 copies in P. lutzii (Pb01) and the remaining copies in P. brasiliensis isolates (141 copies in Pb18 and 89 in Pb03) (Table 2). Most of the intact elements were found in the Pb18 P. brasiliensis isolate (11/24) and Pb01 P. lutzii isolate (9/24), and four copies were found in the Pb03 P. brasiliensis isolate (Table 2). Gypsy elements were the most abundant retrotransposons in Paracoccidioides genomes, comprising approximately 53.34% (287/538) of the total retroelements identified
in these species. Out of 111 copies of LTR-Gypsy-RtPc1, 57 were found in P. lutzii, followed by 39 and 15 in isolates Pb18 and Pb03, respectively. Four intact copies
of RtPc1 were present in P. lutzii, and eight were present in Pb18 (Table 2). The LTR-Gypsy-RtPc2 element was the most abundant RtPc element (176/538) of all
the retrotransposons identified in the Paracoccidioides genomes studied here. It is interesting to note that 56.81% of LTR-Gypsy-RtPc2 copies (100/172) were found in P. lutzii (isolate Pb01) and that all of these were truncated. Only one intact copy of this
element was found in the genomes of P. brasiliensis isolates (Table 2).

Table 2. Distribution of retrotransposons inParacoccidioidesspecies genomes

Copia and LINE retrotransposons comprised 22.5% and 24.16% of all the retroelements in
the Paracoccidioides genomes, respectively. Out of eighteen LTR-Copia-RtPc3 copies, ten were found in the isolate Pb03, followed by five in Pb01 and three
in Pb18. As for the LTR-Copia-RtPc4 element, most copies were found in P. lutzii (n?=?66), followed by P. brasiliensis isolates Pb03 (n?=?28) and Pb18 (n?=?9). For the LINE-Tad-RtPc5 element, again, most copies (n?=?65) were identified in P. lutzii (Pb01), followed by isolates Pb03 (n?=?34) and Pb18 (31). Most of these elements were
truncated in the isolate Pb01 (92.3%) (Table 2).

LTRs not associated with complete retrotransposons

In addition to intact and truncated elements, structural variations of LTR retrotransposons
include solo LTRs, which together with LTR remnants are believed to be the result
of unequal recombination and illegitimate recombination. We identified 468 copies
of solo LTRs closely related to Gypsy– and Copia-RtPc elements; most of these were found in P. lutzii (n?=?222), followed by P. brasiliensis isolates Pb18 (n?=?164) and Pb03 (n?=?81) (Additional file 4). Solo LTRs belonging to the Gypsy superfamily are far more abundant (2.6-fold) than those of Copia-like retrotransposons. The ratio of solo LTR sequences to intact elements was 13.4
in Paracoccidioides genomes, and the ratio of solo-Gypsy LTRs to intact elements was also higher than that for solo-Copia LTRs vs. intact elements (Additional file 4).

The presence of RtPc elements in Paracoccidioides isolates of distinct phylogenetic origins

To investigate the occurrence of RtPc elements, a segment of the coding sequence for
reverse transcriptase was PCR amplified from the genomic DNA of 31 isolates, including
Pb01-like P. lutzii isolates and isolates belonging to the Paracoccidioides phylogenetic lineages S1, PS2 and PS3 from the P. brasiliensis complex (Table 3, Figure 3). Reverse transcriptase was present in 24 isolates. The identity of the amplicons
was confirmed by sequencing a 300-bp fragment corresponding to the coding region for
the reverse transcriptase of each of the five elements (isolates Pb01, Pb03 and Pb18)
(data not shown). The element LINE-Tad-RtPc5 was present in all isolates. No Gypsy
element was found in isolates EPM81 and EPM102. No correlation was found between the
distribution patterns of RtPc elements and the phylogeny of Paracoccidioides lineages.

Table 3. Paracoccidioidesisolates used in this study

Figure 3. PCR screening for the reverse transcriptase domain of RtPc elements in phylogenetic
lineages of Paracoccidioides species. Electropherogram results from PCR analysis of a phylogenetically diverse
panel of Paracoccidioides species. PCR amplification of the reverse transcriptase domain of the five RtPc elements
from the genomic DNA of 31 isolates of the three cryptic species of P. brasiliensis and three isolates of P. lutzii. Panel A shows the PCR amplification products for the five RtPc elements from the genomic
DNA of the three isolates sequenced in the Broad Institute FGI (Pb01, P03 and Pb18).
Panel B shows the amplification products for the 5 RtPc elements from the genomic DNA of
31 Paracoccidioides isolates. The first block is composed by two P. lutzii isolates. The second block
by four P. brasiliensis, the first two being PS2 and the latter two PS3 isolates.
In the third block the first 13 isolates are P. brasiliensis S1 and the latter four
are Paracoccidioides spp. The fourth block, at right, is composed by eight Paracoccidioides spp. Oligonucleotide primer pairs are given in Additional file 6.

Genomic organization and transcription of RtPc elements

We also analyzed the genomic organization of RtPc elements by Southern blot hybridization
using probes corresponding to the retrotransposons LTR Gypsy-RtPc1 and LINE-Tad-RtPc5. Figure 4 shows the results obtained using genomic DNA from P. lutzii (Pb01) and P. brasiliensis (Pb18). As expected from the in silico analysis, the LTR-Gypsy-RtPc1 probe hybridized to multiple genomic fragments from P. lutzii (Pb01) and P. brasiliensis (Pb18), confirming the polymorphic nature and abundance of this element. The number
and signal intensity of hybridizing fragments identified in Pb01 was higher than in
Pb18, confirming the variation in the copy number of LTR-Gypsy-RtPc1. The hybridization patterns obtained with the element LINE-Tad-RtPc5 indicate that these elements were more abundant in the P. lutzii genome.

Figure 4. Genomic Southern blot analysis and chromosomal distribution of RtPc1 and RtPc5 elements.
Genomic DNA from P. lutzii(panels A and D) and the Pb18 P. brasiliensis isolate (panels B and E) were digested with restriction enzymes, blotted onto nylon membranes and hybridized
with RtPc1 (panels Aand B) and RtPc5 (panels D and E) probes derived from the transcriptase reverse region of each element. The restriction
enzymes used were EcoRI (E), BamHI (B), HindIII (H), BglII(Bg), HinfI (Hf), HincII (Hc), AccI (Ac) and EcoRV (Ev). Chromosomal distribution
of RtPc1 (panel C) and RtPc5 (panel F) elements in P. lutzii and P. brasiliensis. The molecular karyotypes of P. lutzii and P. brasiliensis (B339 and Pb18) are shown on the left. Chromosomal bands were separated by PFGE and
stained with EtBr. The autoradiograms from Southern hybridizations using the RtPc1
and RtPc5 probes derived from the transcriptase reverse region are shown on the right.

LTR-Gypsy-RtPc1 and LINE-Tad-RtPc5 elements were mapped to the chromosomal bands of isolates Pb01 and Pb18 (Figure 4, panels C and F), which had been separated by PFGE. Pb01 and Pb18 showed distinct
karyotype profiles with four and five chromosomal bands, respectively. Isolate B339
was used as a reference for chromosomal band size. The LTR-Gypsy-RtPc1 probe hybridized to three and four chromosomal bands in isolates Pb01 and Pb18,
respectively (Figure 4, panel C). The LINE-Tad-RtPc5 probe hybridized to three and two chromosomal bands in isolates Pb18 and Pb01,
respectively (Figure 4, panel F).

To detect transcripts of RtPc elements, semi-quantitative RT-PCR was performed using
cDNAs from isolates Pb01 (P. lutzii), Pb03 and Pb18 (P. brasiliensis, PS2 and S1). Figure 5 shows that the five retrotransposons were transcribed in the yeast form of P. lutzii (Pb01), but only RtPc1, RtPc2, RtPc3 and RtPc4 were transcribed in the yeast form
of P. brasiliensis (Pb03 and Pb18).

Figure 5. RtPc transcript analysis. Total RNA from Pb01, Pb03 and Pb18 was used to detect mRNAs
corresponding to the reverse transcriptase region of RtPc elements. Reverse transcription
data was normalized to ?-tubulin.

Phylogenetic analysis of RtPc elements

The reverse transcriptase domains from the 24 complete elements were employed to establish
the phylogenetic relationship between RtPc elements derived from each of the two species
of the genus Paracoccidioides, P. brasiliensis (isolates Pb18 and Pb03) and P. lutzii (isolate Pb01). Intact copies of LTR-Gypsy-RtPc1 and LINE-Tad-RtPc5 have been identified in P. brasiliensis and P. lutzii. In the phylogenetic tree (Figure 6A), three major clusters can be distinguished. One group contains the LTR-Gypsy elements, the second contains LTR-Copia elements and the third contains the LINE elements, all strongly supported by high
posterior probabilities, thus confirming the classification of these elements. The
phylogenetic tree for the LTR-Gypsy-RtPc1 element (Figure 6B) showed 3 clusters, two of which were composed of species-specific sequences from
P. brasiliensis (Pb18) and P. lutzii (Pb01). The first branch consisted of sequences exclusively from P. brasiliensis (Pb18), comprising 6 of the 8 intact copies found in this species (supercontigs 1.4,
1.6, 1.7, 1.10, 1.11 and 1.14); the central branch was composed of two RtPc sequences
from P. brasiliensis (Pb18, supercontigs 1.1 and 1.3) and three from P. lutzii (Pb01, supercontigs 1:22, 1:19, 1:29 and 1:16), and a high similarity was observed
among the five elements inside this cluster (Figure 6B). The grouping of elements from these two different species illustrates the degree
of similarity between these elements and suggests a common ancestry. The third branch
harbored only one P. lutzii sequence (supercontig 1.16) (Figure 6B). This pattern suggests that the RtPc1 elements from P. brasiliensis and P. lutzii share sequences that were present in a relatively recent common ancestor, thus supporting
the hypothesis of pre-speciation emergence of RtPc1 insertions. The RtPc3-Copia elements
clustered in very close branches and showed a high sequence similarity, which could
be explained by the low number of copies analyzed and by the fact that intact copies
were only identified in P. brasiliensis lineages Pb03 and Pb18. However, for the LINE-Tad-RtPc5 element (Figure 6C), it was possible to observe a species-specific grouping, although only a single
copy of RtPc5 has been analyzed in P. brasiliensis (isolate Pb03). Thus, despite the LINE-Tad-RtPc5 elements from P. brasiliensis and P. lutzii being located in separate branches, they nevertheless are phylogenetically close
(Figure 6C).

Figure 6. Phylogenetic analysis of RtPc elements and related fungal retrotransposons. Phylogenetic
tree of all intact RtPc elements using nucleotide sequences of the reverse transcriptase
domain of P. brasiliensis (Pb 03 e Pb18) and P. lutzii (Pb01) with Bayesian inference. Trees were rooted with the other subfamilies. LTR
elements from Aspergillus fumigatus (Gypsy-3-I_AF), Arthroderma otea (Gypsy-1-I_Ao), Ajellomyces capsulatus(AAJI01001759.1) Blumeria graminis (Copia-11_BG-I),Ajellomyces capsulatus (XM_001540167.1) and Blumeria graminis (Tad1-16-BG) were used as outgroups (see Method). A shows the phylogenetic tree of all 24 RtPc intact elements identified in Pb01, Pb03
and Pb18. LTR-Gypsy-RtPc1 and LINE-Tad-RtPc5 trees are shown with more detail in B and C, respectively.

Based on the assumption that the two LTRs accumulate point mutations independently,
the insertion time of an LTR retrotransposon could be estimated based on variations
or sequence differences between the two LTRs ends, where would have been identical
at the time of insertion 26]. Very low nucleotide divergence was observed in the LTRs of the 18 intact RtPc elements
(Additional file 5). Considering that the analyzed elements are intact and most likely active, the estimated
divergence between LTRs suggested recent insertion events in the genomes of P. brasiliensis and P. lutzii.