Recombination of chl-fus gene (Plastid Origin) downstream of hop: a locus of chromosomal instability


Capture and validation of plant hop and chl-fus gene sequences

The first chl-fus gene was cloned and characterized in G. max12]. From protein sequence alignments of its encoded open reading frame (ORF), as well
as chloroplast-type transit peptide analysis, it was suggested that the mature protein
belongs to the chloroplast protein synthesis machinery 12], 22]. For example, the Arabidopsis thaliana cEF-G (At_cEF-G) shares 44 % identity with its mitochondrial counterpart (At_mEF-G),
while 59 % with Escherichia coli EF-G (?-Proteobacteria), 54 % with Synechococcus sp. EF-G (Cyanobacteria) and 62 % with Agrobacterium fabrum (?-Proteobacteria) EF-G. Many other chl-fus genes have been registered in Genbank, sometimes confounded with mEF-G (not shown).

Gene mapping efforts in G. max, following the discovery of chl-fus gene, revealed that chl-fus locates downstream of hop gene in an opposite orientation 8]. Microsynteny analyses of new sequenced genomes would help us to determine if the
transcriptional convergence of hop and chl-fus genes is ubiquitous, or if G. max is an isolated case. We then used the G. max chl-fus gene as a BLAST query sequence to search for plant genomic contigs, coding for a
predicted cEF-G preceded by a chloroplast-type transit peptide 12], concurrently with a hop gene in convergent transcription. The families, genera and species, and corresponding
accession numbers of retrieved contigs obtained from Genbank are provided in Table 1. In plant species whose chl-fus and hop genes were not syntenic, the G. max hop gene alone 8] was used as query to capture Hop encoding sequences. Using the G. max chl-fus and hop genes as references, we mapped the predicted exon–intron structure of each gene for
all plant species. To validate the assembled ORFs, phylogenetic trees were constructed
in silico with predicted cEF-G and Hop proteins.

Table 1. Accession numbers of retrieved contigs sequences obtained from plant genome databases.
The number of introns of hop and chl-fus genes, respectively, is given in arabic numbers

We show in Fig. 1 a well-supported phylogenetic tree constructed with EF-G sequences from Actinobacteria,
?-Proteobacteria and Cyanobacteria and 53 cEF-G sequences from Chlorophyta, Gymnosperms,
Monocots and Dicots. The branching pattern of the cladogram indicates that EF-Gs from
all life forms descended from a common ancestor. According to the evolutionary relationships,
plant cEF-G sequences group together in a single branch with G. max cEF-G (our reference sequence), confirming that the assembled plant ORFs belong all
to the chloroplast EF-G family. Chlorophyta cEF-G sequences share a common ancestor
with higher plants, excepting Chlamydomonas reinhardtii, which appears to form a clade apart from other members of green algae. The two gymnosperms
are part of the major clade with vascular plants although in separate lineages. Monocot
and dicot branches are coherent with canonical evolutionary trees; however, dicot
branch had low support (bootstrap values less than 50 %) resulting in this clade being
unresolved 23]. As already reported 10], cEF-G sequences show more identity with ?-proteobacterial EF-G than with cyanobacteria
and this finding is confirmed in Fig. 1, without exception. Taking these results together, we concluded that retrieved cEF-G
sequences from Genbank were correctly reconstructed and they code for the chloroplast
translation elongation factor G.

Fig. 1. Phylogenetic tree of chloroplast elongation factor cEF-G sequences from 53 plant genomes.
Bootstrap values are in Arabic numbers. Dicot branch was collapsed (bootstrap values
less than 50 %). Other members of the EF-G family: At_mEF-G: A. thaliana mitochondrial elongation factor G (outgroup). ?-Proteobacterial EF-G: R. prowazekii, A. caulinodans and A. fabrum. Actinobacterial EF-G: K. radiotolerans, F. alni and S. coelicolor. Cyanobacterial EF-G: Synechococcus. 0.08: Distance scale

After intron removal from hop genes, the reconstructed Hop sequences were used to build a second phylogenetic tree
(Fig. 2). As expected, the assembled ORFs belong all to the plant Hop family which exhibits
a large amount of divergence with respect to the outgroup (Human Hop). As seen in
Fig. 2 the inferred relationships among these protein sequences are robust and all branches
are well supported, coherently with current plant systematics.

Fig. 2. Phylogenetic tree of Hop protein sequences of 53 plant genomes. Chlorophyta, gymnosperm,
monocot and dicot orthologous proteins were included. Hs_Hop: Human Hop protein (outgroup).
0.07: Distance scale

Interestingly, Leavenworthia alabamica is grouped with the other members of Brassicales but with an unusual long evolutionary
distance (Fig. 2). Exceptionally, L. alabamica contains three tandem repetitions of the VPEVEKKLEPEPEP motif within the Ch. AA domain,
while all other plants possess only one. These results confirm the correct assembly
of hop genes from retrieved contigs.

Preserved microsynteny and microcolinearity between hop and chl-fus genes

The hop and chl-fus genes were discovered in G. max one after the other on the same chromosome, in convergent transcription arrangement
8]. This finding leads to two intriguing evolutionary questions: Have hop and chl-fus genes been together from the first to the present-day photosynthetic eukaryotes?
Or, is their chromosomal contiguity strictly specific of G. max? The microsyntenic arrangement of hop and chl-fus genes was determined for all 21 plant families under study (Fig. 3, and species-specific details in Additional file 1: Figure S1). In Clorophyta, two families were mapped (Mamiellaceae and Chlamydomonadaceae)
and each gene was found on a separate chromosome, suggesting the absence of microsynteny
in this plant division. This was also the case for gymnosperms (Funariaceae and Pinaceae).
In return, 2 out of 3 studied families of monocots revealed the presence of hop and chl-fus genes on the same chromosome. Only in Ensete ventricosum (Musaceae), the pair of genes was found on separate chromosomes. In the same manner,
the microsynteny is preserved in most of dicots excepting the Cucurbitaceae (3 species)
and Fabaceae (3 out of 5 species) families, where the pair of genes is located on
different chromosomes (Additional file 2: Table S1). In summary, the microsynteny of hop and chl-fus prevails in 75 % (40 out of 53) of green plants studied. A graphic resume of microsynteny
between hop and chl-fus genes among all plant species under study is shown in Additional file 3: Figure S2.

Fig. 3. Microsyntenic arrangement (at scale) of the pair of genes hop and chl-fus, among the 53 plant genomes under study. Hop protein TPR and DP domains are color-coded
according to conventions (bottom boxes). IGR: intergenic region. IGR containing numbers,
e.g., 10000 bp, are not at scale. Non-syntenic genes are drawn on separate chromosomes

Concerning the one-to-one microcolinearity in convergent transcription of hop and chl-fus, three types of genome arrangements (I to III) were found in plants (Fig. 4), as follows: I). Each gene resides on a different chromosome, i.e., they are not collinear (all
Chlorophyta, gymnosperms, one monocot, and six dicots). II) In Malvaceae (Gossypium raimondii and Theobroma cacao) the chl-fus gene moved just upstream of hop and both genes are transcribed in the same direction, i.e., local chromosome inversion
24], 25]; and III) hop and chl-fus are colinear in convergent transcription (no inserted elements), which is the most
frequent arrangement in both monocots and dicots (38 out of 53 species analyzed or???72 %).
Interestingly, Elaeis guineensis and Phoenix dactylifera (monocots), as well as Morus notabilis and Linum usitatissimum (dicot) harbored sequences coding for retrovirus-like proteins within their intergenic
sequences, i.e., inserted between hop and chl-fus genes (see the section about molecular instability of the intergenic region). Detailed
physical maps for each species under study are shown in Additional file 1: Figure S1.

Fig. 4. Grouping of gene arrangements found for the pair of genes hop and chl-fus, among the 53 plant genomes under study. CO: classification by microcolinearity (categories
I to III); GA: classification by gene arrangement, according to the exon–intron structure
of both combined hop and chl-fus (categories A to J). Arabic numbers in parenthesis: number of species sharing the
same gene arrangement; hop and chl-fus genes are represented by arrows to resume gene topology. Ex-Intr hop: exon–intron organizations found for hop gene (categories h1 to h6), Ex-Intr chl-fus: exon–intron organizations found for chl-fus gene (categories f1 to f5). Arabic and roman numbers represent intron phase (0, 1, or 2) and succession of
introns from I to I?+?n, respectively; hop introns are named as I
h
, II
h
, III
h
, etc., and chl-fus introns are named as I
f
, II
f
, III
f
, etc. Exons coding for TPR and DP domains are color-coded according to conventions
(bottom boxes). IGR: intergenic region. Non-syntenic genes are drawn on separate chromosomes

Parallel evolution of exon–intron gene structure of hop and chl-fus genes

The human hop gene contains 13 introns and intron phase was essential to hypothesize the evolutionary
origin of Hop domains, by exon shuffling 6]. However, intron number and phase of plant hop genes are still unknown and this data could reinforce the role of introns in hop evolution from the initial stages of eukaryotic development. Therefore, we examined
the exon–intron organization of hop and chl-fus genes among the 53 plant species, to infer the contribution of introns to the evolution
of their resultant proteins (Table 1, Fig. 3, 4 and Additional file 1: Figure S1).

The simultaneous spatial arrangement of exons and introns in the coding sequences
of the pair hop???chl-fus in plants falls in one of ten categories (A to J), as shown in Fig. 4. In type A (O. lucimarinus and O. tauri), hop lacks introns, while chl-fus holds a single intron splitting the mature protein from the transit peptide-coding
exon (labelled as I
f
). Apparently, Micromonas sp. does not contain introns; however, it is very likely that a 5? intron is located
after the first 18 nucleotides. An exceptionally long predicted Hop protein is reported
in Genbank under the accession number XP_002500383; this polypeptide shares high identity
with other plant Hop proteins, but contains 71 extra amino acids not found in any
other eukaryote. A fine-scale analysis of this insertion suggests that an intron may
have gone unnoticed so far because it is in frame with a 5? short exon, coding for
the conserved amino acids MADEHK. We show in Additional file 4: Figure S3 (A) an HCA alignment of predicted Micromonas sp. [GenBank: XP_002500383] with A. thaliana Hop proteins. In this alignment, a perfect match is obvious between the two proteins,
excluding the extra 71 N-terminal amino acids of Micromonas sp. (bordered by a rounded rectangle). In Additional file 4: Figure S3 (B), we represent the translated 5? regions of Micromonas sp. and predicted C. reinhardtii hop genes. We propose that nucleotides in bold belong to a phase-0 intron (I
h
), which is in frame with the first and second exons. Conveniently, the exon–intron
boundaries conserve the canonical splice consensus sequences AG:

    GT

and

    CAG

:GC 26], 27]. According to this hypothesis, the predicted ORFs encode Hop proteins with the same
number of amino acids than the other plant Hop members (Additional file 4: Figure S3 (C)). In addition, no significant similarity was found with a BLAST search
using the 71 extra amino acids as query (not shown). Taken together, these results
led us to the conclusion that the Micromonas sp. hop gene must enclose one intron located just after the first six codons (amino acids
MADEHK). Thus, Micromonas sp. is classed in type B (Fig. 4), in which both non-collinear genes have a single intron, i.e., 1–1 (Table 1).

In type C, (C. reinhardtii), hop contains 12 introns while fus has 9. Contrary to the other members of division Chlorophyta, C. reinhardtii has accumulated a noticeable plethora of introns; some of them lie in positions shared
with human and higher plants (See next section). In type D (Physcomitrella patens, a gymnosperm), each gene is located in a separate chromosome; hop comprises 7 introns and chl-fus 6. Picea abies???another gymnosperm?, belongs to type E, where hop has the same intron number as type D but the intron number is reduced to 3 in chl-fus gene. In type F (Musaceae (Monocot), Cucurbitaceae and 3 out of 5 Fabaceae (Dicot))
hop and chl-fus are not syntenic, but individual genes hold the same structure 6–3 of the greatest
number of convergently transcribed genes in higher plants (type I). In type G, the
exon–intron structure is the same of type I (6–3), but chl-fus was transposed to the 5? flanking site of hop, and transcribed in the same direction (Fig. 4). In types H (5–3) and J (6–2), hop and chl-fus lack one intron, respectively, with regard to type I. It is concluded that during
the evolutionary process, hop and chl-fus genes underwent extensive changes in their exon???intron structure, among unicellular
photosynthetic eukaryotes, as well as in higher plants. It is interesting to notice
that intron gain/loss affected both genes alike, by species. For example, C. reinhardtii (type C) hop and chl-fus conserved a plethora of introns (simultaneous intron gain?), while both genes in
O. lucimarinus (type A) preserved only one (simultaneous intron loss?). This finding also applies
to higher plants (Fig. 4).

Intron position and phase as determinant of exon shuffling

In previous publications, it has been proposed that domain/module duplication has
contributed to gene evolution through exon shuffling 28]. Bioinformatic analyses of vertebrate Hop orthologs suggested that TPR and DP domains
behaved as a whole recombination unit due to the presence of phase-0 introns 6]. Phase-0 introns are the most favorable for exon duplication or shuffling without
modifying the reading frame 28], and the human hop gene comprises TPR???DP modules surrounded by phase-0 introns. Likewise, by sequence
alignments, it was hypothesized that EF-G emerged as a result of gene duplication/fusion
events 29].

We analyzed the exon–intron topologies and intron phase distribution within plant
hop and chl-fus genes, in order to reconstruct the molecular events leading to the emergence of present-day
genes. As shown in Fig. 4, hop genes can be grouped in 6 classes of exon–intron structure (h1–h6), while fus genes are grouped in 5 classes (f1–f5). Considering only the hop gene, it contains zero, one or more introns in green algae. No introns were found
either in Ostreococcus lucimarinus or O. tauri (Class h1), while Micromonas sp. was predicted to contain one 5? phase-0 intron (Class h2). Contrary to the above mentioned Mamiellaceae family members (Fig. 4), C. reinhardtii (Chlamydomonadaceae) is the photosynthetic eukaryote with the greatest number of
introns, with 12 short intragenic regions equally distributed within the coding region
(Class h3). Although most of introns are phase-0 (9 out of 12), the recombinable module that
most resembles those found in vertebrates is located between phase-0 introns I
h
to VI
h
. This unit contains a complete TPR-DP-Ch. AA module, able to recombine by exon shuffling.
The two gymnosperms, P. patens and P. abies, belong to Class h4 with 7 introns located in equivalent positions. Class h5 is the most abundant gene structure in higher plants (46 species). The first intron
(I
h
, phase-0) splits the TPR2A domain. The rest of introns (3 out of 5 of phase-0) split
the end of the TPR2A-coding exons and the C-terminal TPR2B???DP2-coding sequences.
Finally, Class h6 (Aethionema arabicum, one member out of 9 of the Brassicaceae family) exhibits the same exon–intron topology
of Class h5, except that it lacks the Class h5 intron V
h
, located within the DP2 domain (Fig. 4).

Disparities in intron number among hop orthologs were used to define classes h1 to h6 (Fig. 4). Additional file 5: Figure S4 shows that not all intron positions are conserved among higher plants.
For example, the first intron (phase-0) in C. reinhardtii hop gene (I
h
), that locates between amino acids K and A (red line), is also found in Micromonas sp. but not in either O. lucimarinus, L. alabamica or A. arabicum. The second intron (phase-0) in C. reinhardtii (II
h
) locates between Y and A (blue line), and is exclusive to this species, and so forth.
From Additional file 5: Figure S4 it is inferred that intron positions are mainly conserved among hop genes from higher plants, but only partially between higher plants and Chlorophyta
or plants and human. For instance, C. reinhardtii introns II
h
(0), III
h
(0), IV
h
(1), V
h
(0), VI
h
(0), VIII
h
(2), IX
h
(0) and XI
h
(2) (blue lines) are exclusive to this green alga, while introns VII
h
(0) and XII
h
(0) (red lines) are shared with L. alabamica and A. arabicum and the rest of higher plants. Finally, higher plants contain introns restricted
to Mono and Dicots, i.e., introns II
h
(1), III
h
(0) and IV
h
(2) (red lines). Exceptionally, A. arabicum (Brassicaceae, Class h5) lacks the phase-0 intron V
h
of higher plants (Class h4). In the bottom of Additional file 5: Figure S4 we represent the human Hop protein and its related introns. A careful
comparison of intron location among plants and human reveals that human Hop shares
two introns with C. reinhardtii (i.e., I
h
(0) and X
h
(0), red lines), but not with higher plants.

On the other hand, the chl-fus gene has undergone a higher reduction in intron number with respect to hop. The exon–intron structure was organized under five classes (f1 to f5), according to the number and position of introns (Fig. 4). From algae to higher plants, the chl-fus gene contains a phase-1 intron that separates the signal peptide from the mature
protein; this implies that a new exon coding for a N-terminal transit peptide was
recruited, for the correct trafficking of cEF-G from cytoplasm to the plastids 30]. More precisely, Class f1 embraces all predicted Mamiellaceae chl-fus genes with a single phase-1 intron, inserted between the chloroplast-targeting domain
and the rest of the coding sequence (Fig. 4). On the contrary, the C. reinhardtii (Chlamydomonadaceae) chl-fus gene has eight additional phase-0 introns interspersed within the cEF-G coding region
(Class f2). Class f3 is a single form of chl-fus with five introns located in different places with respect to the rest of plant chl-fus genes. Class f4 is the most prevalent exon–intron organization found in monocot and dicot plants
(47 species). It contains two phase-0 introns, II
f
and III
f
, apart from that coding for the transit peptide (phase-1), located within the 3?
half of the chl-fus gene (Fig. 4). Finally, only one member of Brassicaceae out of 9 (Brassica rapa) belongs to Class f5, which contains three exons and two introns. The B. rapa chl-fus gene lacks intron II
f
with respect to Class f4.

Molecular instability of the hop and chl-fus intergenic region

In several plant families, the intergenic region (IGR) between the hop and chl-fus genes suffered insertions and deletions. While 82 % of monocots and dicots preserve
microcolinearity, the IGR among species is of variable length. For example, the shortest
IGR belongs to Leavenworthia alabamica (188 bp), while the longest belongs to Linum usitatissimum (38523 bp). Nevertheless, the IGR region typically does not exceed 3500 bp (Additional
file 1: Figure S1). IGR nucleotide sequences were analyzed by tBLASTn in order to identify
potential ORFs. Plant retroviruses (or retrotransposons) and hypothetical genes were
found in Monocots (Elaeis guineensis and Phoenix dactylifera) and Dicots (Morus notabilis and Linum usitatissimum), within IGRs 10 kb. For example, a putative pararetrovirus-like pseudogen was found
within the 10 kb IGR of M. notabilis. In Additional file 6: Figure S5 (A), we show a ClustalW alignment between a putative polyprotein encoded
by the M. notabilis IGR and a Citrus endogenous pararetrovirus, retrieved by BLAST (45 % identity). The M. notabilis predicted polyprotein is truncated by 12 aberrant stop codons, suggesting that it
could be a pararetrovirus pseudogen. Furthermore, transposon-like repeated sequences
were found in a number of species. For example, inverted repeat sequences of Miniature
Inverted–Repeat Transposable Elements (MITEs) 31] were found within the IGR of Oryza spp (Additional file 6: Figure S5 (B)) and direct repeats of CACTA-like transposons 32] reside in M. truncatula IGR (not shown).

Two interesting cases of deletions within the IGR have been found in higher plants,
which alter the 3? untranslated region of the hop and chl-fus genes. In Glycine max, a plant with a predicted allopolyploidization event 33], two chl-fus genes were cloned and sequenced from cv. Ceresia (98 % identity between cEF-G1 and
cEF-G2 proteins), both with hop genes in convergent transcription 8]. ClustalW alignments were performed between chl-fus genes of G. max cv. Ceresia and cDNAs from G. max cv. Williams, which contain three different poly-A sites (Additional file 7: Figure S6 (A)). An almost perfect match was found between the coding part and the
3? untranslated region of the cDNAs, chl-fus1 and chl-fus2 genes; however, chl-fus1 drastically lacks identity 123 nucleotides downstream of the stop codon. A detailed
nucleotide analysis allowed to conclude that a chromosomal deletion (ca. 680 bp) maps
between the chl-fus1 and hop1 genes (Additional file 7: Figure S6 (B)).

A more severe case of IGR deletion is found in A. thaliana, in which the 3? transcribed regions of the hop and chl-fus genes overlap. We show in Additional file 8: Figure S7 a chromosomal map of the A. thaliana hop and chl-fus genes, and three cDNAs of each gene, with multiple poly-A sites. As can be observed,
the 3? end of three hop and that of two chl-fus cDNAs overlap. Thus, in the strict sense, the IGR between hop and chl-fus genes is missing; nevertheless, according to the Genbank cDNA accessions, both genes
are transcribed. We concluded that the IGR separating the hop and chl-fus genes in plants seems to be a target region for insertion and deletion (indel) events,
making it genetically unstable.