Asymmetric somatic hybridization induces point mutations and indels in wheat

cDNA sequencing

A total of 19,045 SR3 and 10,327 JN177 clones were sequenced, resulting in the acquisition
of, respectively, 18,192 and 9770 usable sequences (Additional file 1: Table S1). The sequences resolved into 9634 unigenes (2097 contigs and 7537 singletons)
from SR3, and 7107 unigenes (1207 contigs and 5900 singletons) from JN177, of which
full length cDNAs were 4825 and 2975, respectively (Additional file 1: Table S1). The length of most of the unigenes laid in the range 700–1000 nt (Additional
file 2: Figure S1), and their mean GC content was 53.85 % (SR3) and 55.46 % (JN177). The
BLASTn analysis of the unigenes revealed that 2581 were shared (96 % identity) between
SR3 and JN177 (Table 1). In all, 5072 (71.4 %) of the JN177 and 7284 (75.6 %) of the SR3 unigenes shared
96 % identity with sequences represented in the wheat EST database (Table 1).

Table 1. BLASTn-based homology comparisons of unigene sequences

Frequency of single nucleotide polymorphisms (SNPs) in the unigene sequences

Based on the unigene sequences sharing 96 % identity, 15,226 SNPs were identified
within the unigene sequence shared between SR3 and JN177, equivalent to a SNP frequency
of 11.33 per 1000 nt of coding sequence (Table 2). The transition and transversion frequencies were, respectively, 6.70 and 4.63 per
1000 nt. The SNP frequency between JN177 and the sequences represented in the wheat
EST database (JN177 vs Ta comparison) was only about one half of this level (5.77 per 1000 nt) (Table 2), demonstrating that the somatic hybridization process was effective in inducing
point mutations. A comparison based on the sequences of the unigenes shared between
the BA progenitor tetraploid (T. turgidum) and the A genome carrier T. monococcum revealed a SNP frequency of 15.48 per 1000 nt, while that between T. turgidum and Ae. speltoides (related to the B genome progenitor) was 18.51, indicating that a high frequency
of mutation was induced during the formation of allotetraploid wheat. Similarly, the
estimated SNP frequencies between bread wheat and T. monococcum, Ae. speltoides, T. turgidum and Ae. tauschii (D genome progenitor) were, respectively, 12.02, 16.24, 12.13 and 5.40 per 1000 nt
(Table 3). Thus the mutation frequency induced by the somatic hybridization process appeared
to be similar in extent to that induced by allopolyploidization. The frequency of
SNPs between the unigene sequences of bread wheat and those of either T. monococcum or Ae. speltoides was less than that between T. turgidum and either T. monococcum or Ae. speltoides unigenes (Table 3). This coincided with the finding that the SNP frequency of SR3 and wheat database
EST (SR3 vs Ta alignment) was lower than those of the SR3 vs JN177 alignment (Table 2). The SNP frequency between SR3 unigene sequences and those of the A, B, BA and D
genome species was similar to those between JN177 unigenes and those of the A, B,
BA and D genome species (Table 3).

Table 2. The SNP frequencies in SR3 and JN177

Table 3. SNP and indel frequencies among wheat and its ancestors

The size distribution of indels in the unigene sequences

The indels ranged from 1 nt to 574 nt, and a majority of the indels involved only
1 nt. A significant number of indels was revealed by aligning the matched unigene
sequences, with the frequency of larger indels (23 nt) being clearly less than that
of the small ones (1–10 nt) (Table 4). There had 82.14 % unigenes possessing small indels when compared between SR3 and
JN177, higher than those from SR3 vs Ta and JN177 vs Ta comparisons. On the contrary, 6.70 % unigenes had large indels in the comparison
between SR3 and JN177, lower than those of other two comparisons. There had more unigenes
with small insertions than those with small deletions, and the difference was stronger
in the SR3 vs Ta and JN177 vs Ta comparisons. Unigenes with large insertions were similar to those with large deletions
in the SR3 vs JN177 comparison, but unigenes with large deletions were more abundant than those
with large insertions in the other two comparisons. The comparison between the JN177
(and similarly SR3) unigene sequences with those represented in the wheat EST database
showed that for small indels, the ratio of insertion to deletion frequency was negatively
correlated to indel length (R2
?=?0.62 and 0.72, respectively) (Fig. 1a); the ratio was 1 for indels shorter than 6 nt, and 1 for indels longer than 6 nt.
However, for the larger indels, the insertion to deletion ratio was positively correlated
to indel length (R2
?=?0.59 and 0.65, respectively) (Fig. 1b); in indels ranging in length from 20 to 70 nt, the ratio was just 0.01–0.06, rising
to 0.28–0.86 for indels of length 71–200 nt, and to ~1.5 for indels longer than 200 nt
(Fig. 1b). The SR3 vs JN177 comparison revealed an insertion to deletion ratio of ~1 irrespective of indel
length (Fig. 1a, b).

Table 4. Indel variation in SR3 and JN177 unigene sequences

Fig. 1. The distribution of indel lengths. a: small indels. b: large indels. The insertion/deletion ratio was obtained by dividing the number of
insertions by the number of deletions. SR3-JN177: JN177 unigene sequences queried
with those of SR3. SR3-Ta: SR3 unigene sequences queried with wheat ESTs housed in
GenBank. JN177-Ta: JN177 unigene sequences queried with wheat ESTs housed in GenBank.
The correlation between indel size and insertion/deletion ratio was performed using
the Pearson correlation analysis

The frequency of indels in the unigene sequences

Small indels were used to calculate the indel frequency because they were markedly
more abundant than large indels (indel numbers not shown). In all, the 2581 matched
unigenes derived from the SR3 vs JN177 comparison revealed 2120 indels (1.58 per 1000 nt). Based on the JN177 sequences,
these comprised 1331 insertions and 789 deletions in SR3, equivalent to frequencies
of, respectively, 0.99 and 0.59 per 1000 nt (Table 5). In the comparison with the sequences represented in the wheat EST database, the
indel frequency in SR3 was 1.36 per 1000 nt. The similar comparison between JN177
and the sequences represented in the wheat EST database revealed an indel frequency
of only 0.93 per 1000 nt, implying that the asymmetric somatic hybridization process
was effective in inducing indels in coding sequence.

Table 5. The frequency of small indels in SR3 and JN177 unigene sequences

To compare the induction rates of indels caused by somatic hybridization and allopolyploidization,
equivalent calculations were made using the matched unigene sequences present in bread
wheat and its relatives/progenitor species (Table 3). Comparing the unigenes of T. turgidum with those of T. monococcum and Ae. speltoides revealed an indel frequency of, respectively, 1.90 and 1.28 per 1000; the equivalents
between bread wheat and its various related species ranged from 1.42 to 2.31 per 1000 nt,
with the comparison involving Ae. tauschii producing the highest estimate. The indel frequencies estimated using the unigenes
of JN177 and SR3 also lay in the range 1 to 2 per 1000 nt, with the exception of the
comparison with Ae. tauschii, where the frequencies were, respectively, 4.42 and 5.37 per 1000 nt (Table 3).

In the SR3 vs JN177 comparison, the insertion frequency (0.99 per 1000 nt) was 1.69 fold to the
deletion frequency (0.59 per 1000 nt) (Table 5). Based on the sequences represented in the wheat EST database, the insertion frequencies
were higher by 3.41 and 2.58 fold than deletion frequencies in SR3 and JN177, respectively.
Especially, the insertion frequency in the SR3 vs Ta comparison (1.05 per 1000 nt) was similar to the SR3 vs JN177 comparison (0.99 per 1000 nt), but its indel frequency was lower than the latter.
Thus, the preference to small insertion was decreased in wheat asymmetric somatic
hybrids in comparison with allopolyploid wheat.

The function of unigenes showing sequence polymorphism

To know whether the genetic variation is associated with their biological processes,
we selected sequences participating in gene expression regulation and other processes
for analysis. A selection of polymorphic unigenes represented in the JN177 and SR3
libraries (948 and 1519, respectively) as well as in the wheat EST database showed
that for gene expression regulation, the frequency of SNPs differed most notably in
genes involved in nucleosome assembly and chromatin assembly/disassembly (Table 6). The least polymorphic comparison was between JN177 and the wheat EST database unigenes
(4.59 SNPs per 1000 nt for the genes involved in the former category and 3.72 per
100 nt in the latter). The same comparison between SR3 and JN177 produced, respectively,
18.17 and 19.21 SNPs per 1000 nt. The next most variable genes were those encoding
products involved in translation and post-translational modification, where the SR3
vs JN177 comparison revealed a SNP frequency of, respectively, 13.30 and 13.97 per 1000 nt,
while the JN177 vs Ta comparison showed 8.69 and 7.98, respectively. The range in SNP frequency for
genes associated with metabolic processes ranged from 5.45 to 6.29 per 1000 nt in
the JN177 vs Ta comparison and 9.16–14.25 per 1000 nt in the SR3 vs JN177 comparison (Additional file 3: Table S2). SNP frequencies were also higher in the SR3 vs Ta comparison than in the JN177 vs Ta comparison for genes encoding proteins involved in various metabolic processes
except for glycolysis as well as nucleobase, nucleoside, nucleotide and nucleic acid
metabolic process (Additional file 3: Table S2). This difference in SNP frequencies was also found in ESTs of protein
fate, transport, cell redox homeostasis, and response to (oxidative) stress (Additional
file 4: Table S3). With respect to unigenes varying at the level of indels, the frequency
of polymorphism was lower in the JN177 vs Ta than in the SR3 vs JN177 comparison (Table 6; Additional files 3 and 4: Table S2 and S3). Indel events were noticeably rare in genes involved in nucleosome
assembly and chromatin assembly/disassembly (Table 6). The respective frequencies were 0.43 and 0.38 per 1000 nt in the JN177 vs Ta comparison and 0.95 and 0.92 per 1000 nt in the SR3 vs JN177 comparison.

Table 6. The function of unigenes affected by SNP and indels in SR3 and JN177

Sequences flanking indels in the unigenes

The sequences flanking indels were characterized by calculating the GC content of
the ten nucleotides flanking either side of the indel. There was no obvious difference
in GC content between 5? and 3? terminal flanking sequence in any of the comparisons
(JN177 or SR3 vs wheat database ESTs, SR3 vs JN177) (data not shown). The trend of GC content was similar when the second to tenth
nucleotides of the flanking sequence were considered (Fig. 2). The GC content of the nucleotides positioned three, six and nine away from the
indels was higher than that of the ones positioned four, five, seven or eight away
in the 3? terminal flanking sequences, but the rule was not found in the 5? terminal
flanking sequences. The GC content of the flanking sequence was higher in the SR3
vs JN177 comparison (53.47–54.04 %) than in the other two comparisons (51.88–52.88 %
in JN177 vs Ta, 50.99–52.63 % in SR3 vs Ta). In the SR3 vs Ta and JN177 vs Ta comparisons, the GC content of the flanking sequence of the deletions was higher
than that of the insertions, while the content of the indels was close to that of
the insertions. However, in the SR3 vs JN177 comparison, the GC content of the flanking sequence of the deletions was generally
lower than that of the insertions, and the difference between the deletions and insertions
was weaker than that in the JN177 or SR3 vs Ta comparisons; the content of the flanking sequence of the indels did not bias toward
either the deletions or insertions. In the JN177 or SR3 vs Ta comparisons, with respect to the nucleotide lying on the 5? side of the indels,
the GC content of the nucleotide adjacent to the deletions was significantly lower
than that of other nucleotides in the flanking sequence, but the GC content of the
nucleotide adjacent to the insertions was significantly higher than that of other
nucleotides in the flanking sequence (Fig. 2b, c). On the other hand, for the nucleotides lying on the 3? side of the indels, the
GC content of the nucleotide adjacent to the deletions remained high and was comparable
to that of the second and third flanking nucleotides, but the GC content of the nucleotide
adjacent to the insertions was significantly lower than that of the other flanking
nucleotides (Fig. 2b, c). In contrast in the SR3 vs JN177 comparison, for the nucleotides lying on the 5? side of the indel, the GC content
of the nucleotide adjacent to both the deletions and insertions was higher than that
of the second to tenth nucleotides of the 5? flanking sequence, while for the nucleotides
lying on the 3? side of the indels, the GC content of the nucleotide adjacent to both
the deletions and insertions was lower than the second and third nucleotides (Fig. 2a).

Fig. 2. Variation in the GC content in the sequence immediately flanking indels. a: SR3-JN177, JN177 unigene sequences queried with those of SR3. b: SR3-Ta, SR3 unigene sequences queried with wheat ESTs housed in GenBank. c: JN177-Ta, JN177 unigene sequences queried with wheat ESTs housed in GenBank. -10?~??1:
The tenth to first nucleotides on the 5? side of the indel. 1?~?10: The first to tenth
nucleotides on the 3? side of the indel. In-mean, Del-mean and indel-mean: the GC
content of 5? and 3? flanking sequences of insertions, deletions and indels, respectively

Indel classification

We further compared the characteristic of flanking sequences of large indels in the
SR3 vs JN177 comparison. The flanking sequences of 45 large insertions and 39 large deletions
identified in the SR3 vs JN177 comparison were identical (Fig. 3a, d). Two examples were SR3_2LCP226_E06 (474 nt insertion) and SR3_firstas1573 (34 nt
deletion) (Additional file 5: Figure S2A, B). A second group had 40 insertions and 36 deletions, whose two terminals
possessed repeated sequences in SR3 and JN177 (Fig. 3b, e). The flanking sequence of SR3_5V50 (179 nt insertion) harbors a run of G’s; the
3? flanking sequence of SR3_firstas843 (55 nt deletion) includes two copies of CATCCC
in JN177 but only one in SR3 (Additional file 5: Figure S2C, D). The repeat motifs present in the flanking sequence ranged in length
from 1 to 51 nt (data not shown). A 1 nt motif was present in 23 of the insertions
and 19 of the deletions, dominated by runs of G (data not shown). The third group,
in which the flanking sequence was modified (Fig. 3c, f), is exemplified by SR3_firstas1573 (141 nt deletion), where SNPs were generated
at three positions adjacent to the deletion (Additional file 5: Figure S2E). A few of the unigenes experienced multiple indel events: SR3_firstas1573
carries two deletions, one belonging to the first group and the other to the third
group (Additional file 5: Figure S2B, E). Other variants included the induction of translocated and chimeric
sequences. In the homologs SR3_2LCP192_G10 and JN177_firstas231, the same sequence
was found in positions 1187–1367 in the former allele, but at 1–181 in the other.
SR3_2LCP192_G10 also harbors a large deletion with a repeat sequence CAAGAAGGA (Additional
file 6: Figure S3A). SR3_firstas716 nucleotides 88–196 do not align with JN177_LCP139_D11
nucleotides 157–278, but their terminal sequences are identical (Additional file 6: Figure S3B).

Fig. 3. Hypothetical model for the formation of large indels induced during asymmetric somatic
hybridization. Blue block: unigenes shared by SR3 and JN177. Red block: insertion
and deletion fragments. Black block: repetitive sequences in the indel flanking sequence.
Gray blocks: small fragments adjacent to insertions and deletions, differing sequence
between SR3 and JN177. SR3-JN177: JN177 unigene sequences queried with those of SR3.
SR3-Ta: SR3 unigene sequences queried with wheat ESTs housed in GenBank. JN177-Ta:
JN177 unigene sequences queried with wheat ESTs housed in GenBank