Comparative in silico analysis of SSRs in coding regions of high confidence predicted genes in Norway spruce (Picea abies) and Loblolly pine (Pinus taeda)

We have considered the high confidence full length coding regions of genes for the
SSR analysis for the first time in gymnosperm species, while all the earlier studies
involving gymnosperms have been carried out on ESTs. In addition, previously applied
methodology also differs from ours (reviewed by 14]), e.g. some studies have considered 5? UTR, ORF and 3? UTR separately 14], while some have considered only 5?ESTs and 3?ESTs 15]. In the current study we have also analysed the class I and class II separately.

Overall abundance of SSRs in Picea abies

Counts per Mbp SSR motifs were higher in Picea abies (Table 1), which is in partial agreement with earlier investigations 14], 15], 19] considering that in the current study the difference in counts per Mbp SSR motifs
between the two species was significant only for class II SSRs. The motif length detected
in the current study (class I SSRs) was lower as compared to the earlier studies in
both genera 14], 18], but it is noteworthy that the standard error reported in the current study is also
very low. In Picea abies, the overall abundance of SSR loci in class I is primarily the result of a higher
frequency of trimers, which is three times higher compared to Pinus taeda (count per Mbp of hexamers in both species is similar – Table 2), whereas the higher frequency of SSRs in class II in Picea abies is largely as a result of additive effect of trimers and hexamers. This is again
not in favour of an earlier study where the count per Mbp of trimers in both species
was similar whereas the count per Mbp of hexamers was higher in Pinus taeda19].

Frequency of dimer motifs

Dimers were not detected in the class I SSR type in Picea abies and although were detected in the class II SSRs, they were not the most abundant
types as found previously 14], 15], 18]. In a broader view, dimers are more frequent in lower plant species (algae and mosses),
while trimer motifs are more frequent for the majority of higher plant groups (flowering
plants) 18]. With reference to Picea abies, higher abundance of dimers was detected in EST-SSRs, but the majority of the studies
were conducted on Picea spp. 15], 19], 24]. The only study conducted on Picea abies detected trimers (trimers??pentamers??hexmers) as the most abundant repeat 16]. Therefore, either the trimer frequency is species specific or the analysis is dependent
on the data source involved and the parameters used for the detection of SSR repeats.
In Pinus taeda on the other hand, trimers were most frequently detected in Pinus spp. 25], while the majority of the studies involving Pinus taeda15], 18], 19], except one 17], showed dimers as the most abundant repeats. In our study, dimers represented the
most abundant motifs after hexamers and trimers in class I SSRs, while it was the
least detected category of SSR repeats in class II (Table 2). Overall, trimers were the most abundant motifs together with dimers in most of
the studies in both species 15], 17], 19], 24]. Previously, it was reported that although a higher abundance of dimers was detected
in EST-SSRs, the proportion of dimers to trimers decreased significantly in the ORF
fraction in the majority of the genera including both angiosperm and gymnosperm species
14]. The sequence data is being updated continuously with recent advancements and as
explained earlier, the use of a different sequence dataset for the SSR analysis is
the most likely reason for not finding dimers as the most abundant motifs in both
species.

Trimers and hexamers are the most abundant motif types

Genome wide studies conducted to estimate the SSR distribution in eukaryotes reveal
abundance of trimers and hexamers in the coding regions in lower single cellular organisms
e.g. yeast 31] as well as higher organisms e.g. model plant systems like Arabidopsis32], 33] and also in more complex organisms like human beings 34]. Trimers and hexamers are predominant as they are favoured by the selective pressures
compared to the other repeats (e.g. dimers, tetramers and pentamers) considering that
they do not alter the coding frame due to frameshift unless the length of the indel
is divisible by three, e.g. in case of dimers an addition of three repeat motifs (e.g.
ATATAT) will not modify the reading frame 35].

Although trimers were the most frequent motifs detected in the class I category, hexamers
ranked as the next most abundant motifs in this class in Picea abies, while in Pinus taeda trimers and hexamers were equally abundant (Table 2). It is noteworthy that in Picea abies the proportion of trimers to hexamers in the same class is 3.1. The higher and lower
proportion of trimers to hexamers in Picea and Pinus taeda, respectively, is similar to what has been reported by Berube et al. 15], but contrasts with the recent comparative study where the proportion of trimers
to hexamers was lower in Picea spp. (1.5) and slightly higher in Pinus (1.3) 14]. Hexamers were the most abundant among the class II SSR types in both species and
their count per Mbp was very high as compared to the other motif types. Predominance
of trimers in Picea abies16] and Pinus taeda17] was reported earlier only in two studies, likewise Yan et al. 25] demonstrated higher frequency of trimers it in Pinus spp. Abundance of hexamers in gymnosperms is in accordance with earlier results in
Picea15], 16], Pinus15], and Cryptomeria 36], as well as in comparative studies, which report hexamers to be more common among
EST-SSRs in gymnosperms than angiosperms 14], 18]. The estimation of hexamer repeats was however under-estimated in earlier studies
14], 15], as a consequence of analysing only class I SSRs, whereas the current analysis reveals
that there is very high abundance of hexamer repeats if class II SSRs are also taken
into consideration (1100 and 971 per Mbp in spruce and pine, respectively).

Similar to previous investigations, AAT/ATT was one among the most frequent class
I trimers in Pinus taeda19] (Table 3). AAG/CTT was also one among the most abundant trimers, which was reported as the
most frequent trimer in other studies in Pinus17], 25] closely followed by ACG/CGT and AGG/CCT 17]. AGG/CCT and ACG/CGT were the most frequent trimer motifs within the class I category
in Picea abies, which is similar to our previous results in the ORF fractions of Picea14]. ACG/CGT was also the most abundant trimer detected by Berube et al. 15] in Picea and Pinus taeda. AAG/CTT motif was among the most abundant trimer repeats in class II SSRs of both
species and class I SSRs of Pinus taeda, which was reported to be the second most frequent in Pinus and third most frequent in Picea within the class I trimers 14]. It is noteworthy that AGG/CCT and ACG/CGT are the trimer repeats detected in class
I and class II as the most and equally abundant motifs among the others in both species.

Frequency of AT-rich and GC-rich motifs

Abundance of AT-rich motifs was detected in class II SSRs in both species, which is
in agreement with earlier studies in conifers 14], 15] (Table 5). Equal frequency of AT-rich and GC-rich motifs were found in class I SSRs of Pinus taeda while class I SSRs in Picea abies showed higher abundance of GC-rich motifs in contrast to earlier reports 14], 15]. This could be attributed to the difference in the data source considered, as the
method used for detection of SSRs was similar as our previous study 14]. AT-rich segments in the coding region regulate DNA replication 37], while GC-rich elements in the coding region play important role in gene regulation
38].

GO annotation

Among genes containing class I SSRs in both species, GO distributions show that the
highest numbers of genes belong to the metabolic process, cell and binding, respectively
for three main GO categories (Fig. 1). Similar results were reported in Physcomitrella patens and Arabidopsis thaliana18]. However, the GO term with the highest number of genes containing SSR loci in Cryptomeria
36] was cellular process instead of metabolic process as is the case in Pinus taeda and Picea abies. Therefore, we suggest that the GO distribution may be species specific rather than
generalised for gymnosperms as such.

Among class I SSR loci, glutamine (Glu) is the most represented amino acid in both
conifer species studied (Fig. 2). In contrast, serine (Ser) was found to be the most frequent in Gnetum while arginine (Arg) was the most frequent in Pinus taeda18]. In class II, Ser is the most frequent amino acid followed by Arg and leucine (Leu)
in Picea abies, while Leu ranks first, followed by Ser and Arg in Pinus taeda. It is worth noticing that tyrosine (Tyr) ranks last in all cases. In this context,
Glu and Ser repeats are amongst the few single amino acid repeats which are incorporated
into many proteins to a considerable extent 39] and polyserine repeats are the most abundant in Arabidopsis40].