The draft genome of Primula veris yields insights into the molecular basis of heterostyly


Genome and transcriptome assemblies

We generated a large amount of sequence data from a diverse suite of sequencing libraries
to assemble the draft P. veris genome. A full account of our sequencing efforts is shown in Table 1. Using these data, we employed a two-step strategy for de novo genome assembly in order to fully leverage the long-read data generated by PacBio
RS. Our first assembly was performed using only short-read (that is, 100 to 250 bp)
sequences generated from standard paired-end and 3 to 9 Kb mate-pair libraries on
Illumina HiSeq, MiSeq, and Ion Proton platforms. This assembly was based on a total
of 54.5 Gb of raw data and resulted in a total of 48,812 contigs that were grouped
into 9,002 unique scaffolds (Additional file 1: Table S1). The total contig length is 232.2 Mb and the total scaffold length, including
gaps, is 301.8 Mb, which represent 49% and 63%, respectively, of the estimated 479.22 Mb
P. veris genome 65]. The N50 contig size is 9.5 Kb, and the N50 scaffold size including gaps is 164 Kb,
with a median gap size of 960 bp. The largest and smallest scaffolds are 2.14 Mb and
888 bp, respectively.

Table 1. Summary of sequence data used in thede novoassembly of the draftP. verisgenome

The transcriptome sequencing of P. veris and P. vulgaris yielded more than 200 million paired reads, which were distributed quite evenly among
the six RNAseq libraries that we multiplexed and sequenced in parallel on a single
Illumina flow cell (Additional file 1: Table S2). Our de novo P. veris transcriptome assembly contains 25,409 putatively unique transcripts, and the P. vulgaris transcriptome assembly contains 24,318 unique transcripts. We also performed de novo transcriptome assemblies of the related species P. obconica, P. wilsonii, and P. poissonii using data available on the Genbank Sequence Read Archive (SRA). Our de novo transcriptome assembly of P. obconica was found to contain 22,752 transcripts, but the transcriptome assemblies of P. poissonii and P. wilsonii were considerably smaller, with 11,905 and 12,927 unique transcripts, respectively.
This reduced transcript pool likely represents the fact that the publicly available
Genbank SRA data for these species contained far fewer raw sequence reads 66] than those we produced for P. veris and P. vulgaris, and than the reads available in the SRA for P. obconica.

To improve our P. veris genome assembly, we generated 3.4 Gb of PacBio RS sequence data with an average read
length of 3,658 and an average base quality of 0.834 (Additional file 2: Figure S1). These data, which contribute an additional 8× coverage for the P. veris genome, were integrated into our assembly through the implementation of the PbJelly
software tool 67]. The PBJelly algorithm essentially anchors the long PacBio reads to the existing
contigs and scaffolds and extends them partially spanning the contiguous gaps. When
PacBio sequences are long enough to be anchored to both sides of a gap, then the gap
is fully closed. Adding the PacBio RS data to the assembly resulted in a total of
21.15% of the gaps in the previous assembly being entirely closed and 38.4% of the
ambiguous positions in the gaps being filled. This translated into a significant improvement
of both the compactness and completeness of our de novo P. veris genome. At 269.7 Mb, the new total contig length increased by 19.4%; the new total
scaffold length (including gaps) reached 310.07 Mb, a 2.7% increase over the draft
assembly based on the Illumina and Ion Proton data alone. In terms of genome completeness,
these contig-length numbers represent 56.3% and 64.7%, respectively, of the P. veris genome. The updated draft is also more compact: the number of contigs was reduced,
by 16.88%, to 40,569 contigs, grouped into 8,764 unique scaffolds (2.65% fewer scaffolds,
see Additional file 1: Table S1). At 13.3 Kb, the N50 contig size grew by 40%, whereas the N50 scaffold
size including gaps reached 165.9 Kb, with a median gap size of 664 bp. The largest
and smallest contigs are 233,407 bp and 19 bp, respectively. The sizes of the largest
and smallest scaffolds remained 2.14 Mb and 888 bp, respectively; this is not surprising,
since gap-closing tends to bridge scaffolds of low to medium size. The assembly scaffolds
on average contain 29 ambiguities per 10 kb. The GC content of the scaffolds, excluding
gaps, is 33%, which is similar to the genome sequences of tomato (34%; 68]) and kiwifruit (35.2%; 69]).

Although the success of a draft genome assembly is strongly dependent on the genetic
complexity of the specific organism and its genome size, we find the overall quality
of the draft P. veris assembly, based on the aforementioned measures, to be consistent with the majority
of de novo genome assembly efforts reported in the last few years. Unsurprisingly, genome assemblies
generated from highly homozygous samples, such as the PN40024 grapevine line 70], or those employing a larger number of ad hoc mate pair libraries (for example, domesticated apple Malus?×?domestica71]; kiwifruit Actinidia chinensis69]; mulberry Morus notabilis72]) resulted in higher quality drafts exhibiting a three- to four-fold decrease in the
number of scaffolds and a two- to four-fold increase in the corresponding N50, relative
to our assembly. However, when compared to assemblies in which similar resources were
used, our assembly is quite compact. For example, our P. veris assembly is between two and 10 times more compact than the draft genomes of Cannabis sativa73] and European pear (Pyrus communis ‘Bartlett’ 74]), which contain 136,290 and 142,083 scaffolds, respectively, and scaffold N50s smaller
than 90 Kb. These genome projects employed two mate pair libraries of sizes similar
to our study, but it is important to point out that their minimum contig size was
less than 500 bp, while our assembly is based on a minimum contig size of 1,000 bp.
Our assembly is one of the first to specifically employ PacBio sequence data to fill
scaffold gaps in a heterozygous plant genome, and this aspect, together with the use
of a third, larger mate pair library, may be responsible for the improved compactness
(that is, fewer contigs) of our assembly.

Genome annotation and quality assessment

Of the 8,764 scaffolds that make up our genome assembly, 2,495 (28.5%) were found
to contain annotated genes. Our final annotation of the P. veris genome contained a total of 19,507 predicted genes, less than the number of genes
found in the relatively well-annotated genomes of Arabidopsis thaliana (27,029 genes; 75]), sacred lotus (Nelumbo nucifera, 26,685 genes; 76]), and mulberry (29,338 genes; 72]). Furthermore, we find about 7,000 fewer genes predicted in the genome annotation
compared with the de novo P. veris transcriptome assembly (25,409 predicted genes; see above), but most likely the real
number of genes is somewhere in between those two estimates, and a combination of
factors might have contributed to underestimating the predictions in the genome annotation
and inflating the number of genes predicted in the de novo transcriptome assembly. First, transcripts predicted by the de novo transcriptome assembly are based on only one source of evidence, that is, the RNA-seq
data, and represent only one source of evidence in the Maker2 genome annotation process.
For a transcript to be reported by Maker2, at least two sources of evidence are required,
hence those transcripts that are not strongly supported by a second source of evidence,
such as ab initio gene prediction algorithms or protein homology are not reported in the final genome
annotation. Additionally, a de novo transcriptome assembly alone is often not able to completely discriminate between
multiple isoforms of a specific gene, which may slightly inflate the estimated number
of genes based on these data alone. The predicted gene content of P. veris is significantly reduced compared to kiwifruit (39,040 genes 69]) and European pear (43,413 genes 74]), probably owing to the relatively recent whole genome duplications in the latter
two species. The Maker2 pipeline identified a total of 279,271 repetitive elements
representing approximately 7% of the P. veris draft genome assembly (Additional file 1: Table S3).

To assess the completeness of the final genome assembly, we searched for the presence
of 248 conserved core eukaryotic genes using the CEGMA software package (CEGs; 77]). We found confident hits to 198 (79.84%) full length (that is, 70% alignment) CEG
proteins and 234 partial (94.35%) CEG proteins, where partial matches are defined
by the CEGMA default pre-computed minimum alignment score for each CEG 77]. These results suggest an exceptional level of completeness for this first draft
genome sequence of a heterozygous, non-model plant. Annotated scaffolds contained
a median number of four genes, and the largest scaffold (Contig0; 2.15 Mb) was found
to contain 352 predicted genes, and generally, longer scaffolds tend to contain more
genes. The quality of the gene prediction and annotation was evaluated in a number
of ways. Searching the predicted gene set with the InterProScan tool, we found that
15,659 (85.6%) of predicted proteins contain PFAM domains. One commonly applied quality
metric of genome annotations is the cumulative annotation edit distance (AED), which
represents the agreement between predicted gene models and external evidence. A gene
model with an AED score of zero indicates complete agreement between the predicted
gene model and, for example, a transcript from the de novo P. veris transcriptome assembly, which we incorporate as external evidence in the Maker2 pipeline
78]. As the ab initio gene predictors are trained through iterative Maker2 runs, the cumulative fraction
of predicted gene models with low AED scores increases, thus indicating more agreement
between gene models and external evidence (Additional file 3: Figure S2). In our final annotation, approximately 80% of gene models have an AED
score of 0.2 or less, indicative of a high degree of agreement between predicted gene
models and external evidence.

Comparative transcriptome analyses

We used the OrthoMCL pipeline 79],80] to identify sets of putatively orthologous loci at two phylogenetic scales: (1) among
transcriptomes of five species of Primula; and (2) among P. veris and four Euasteridae species with published genome assemblies (Figure 1). Our comparative Primula transcriptome analyses confidently identified a total of 4,207 orthologs among the
transcriptomes of P. veris, P. vulgaris, P. obconica, P. wilsonii, and P. poissonii, with 2,391 of these likely representing single-copy genes in all of the species
(Figure 1A; see Additional file 4). The most recent phylogenetic hypotheses for the genus Primula suggest that P. veris, P. vulgaris31] and P. poissonii, P. wilsonii, respectively, belong to two different sister clades, 66], with P. obconica included in a third clade basal to the other two clades 13],14]. Our results are consistent with these hypothesized phylogenetic relationships, for
we find more shared orthologs between P. vulgaris and P. veris than in all other pairwise comparisons (3,205), and a similar number of putatively
orthologous genes limited to P. veris, P. vulgaris, and P. obconica (3,211). However, we do not see a similar number of shared orthologs unique to the
sister species P. poissonii and P. wilsonii (487), but this result might be explained by the fact that these transcriptomes were
assembled with far less sequence data 66] than our de novo transcriptome assemblies, and thus they are likely to represent a smaller proportion
of the expressed genes. Our comparative analyses of euasterid transcriptomes identified
a total of 8,079 putatively orthologous genes, and of these approximately 2,402 are
likely to be single copy in all examined species (Figure 1B; see Additional file 5). These orthologous gene sets provide a framework for the development of future phylogenomic
studies aimed at resolving species-level relationships within the genus Primula, or deeper phylogenetic relationships within the Euasteridae.

Figure 1. Venn diagrams of orthologous gene clusters. The number of putatively unique transcripts in each transcriptome is shown below
the taxon name. The value in parentheses represents the number of orthologous genes
that are most likely to be single-copy in all of the five transcriptomes sampled.
(A) Orthologous groups identified in comparative transcriptome analysis of five species
of Primula. (B) Orthologous groups identified in comparative transcriptome analysis of five Euasterid
species with draft genome assemblies. See also Additional files 4 and 5.

Gene expression differences between floral morphs and functional analysis of candidate
genes

To search for genes that might underlie the phenotypic differences between the floral
morphs in Primula, we compared gene expression between flowers of L- and S-morph plants in both P. vulgaris and P. veris. Given that we sampled RNA from floral bud tissues 3 to 5 days prior to anthesis,
our results are primarily relevant to the later stage of floral development. At this
stage, cell growth in the style and in the corolla tube below the point of stamen
attachment differs between L- and S-morph flowers. Differential cell growth is thought
to be primarily responsible for differences in style length and the relative position
of the anthers in the middle or the top of the corolla tube, respectively 48]. We adopted a Benjamini-Hochberg false discovery rate (FDR) at 5% level to correct
for multiple testing. At an FDR-adjusted P value threshold of 0.05, we found 620 and 677 genes in P. veris and P. vulgaris, respectively, that exhibited significant differential expression between morphs.
To reduce the number of false positives in this gene set, we retained only the 113
genes that exhibit morph-specific differential expression in both P. veris and P. vulgaris floral buds (Additional file 1: Table S4). Of these 113 genes, the majority (that is, 73 genes) showed increased
expression in S-morph versus L-morph floral buds. Because the distylous phenotype
is limited to floral tissues, it is likely that genes involved in determining the
floral morph would not be differentially expressed in L- and S-morph leaf tissues.
We find that, of these 113 genes showing floral morph-specific differential expression,
69 are not differentially expressed in P. veris L- and S-morph leaves (Additional file 1: Table S4).

We examined the candidate set of 113 differentially expressed genes to determine the
types of functional gene classes represented. Functional annotations for 92 of the
113 genes were clustered using the DAVID bioinformatics resources, while the remaining
21 genes contained no conserved protein domains and are not readily assignable to
functional classes. Annotation clusters represent groups of functionally similar GO
terms associated with the gene list, thus providing a more direct biological interpretation
of related terms for large gene lists 81]. As seen in Figure 2, GO terms attributed to the 92 genes showing morph-specific differential expression
can be grouped into 17 annotation clusters. More than 40% of the candidate genes involved
in differential floral morph expression contain GO terms attributable to annotation
cluster 1, which is characterized by the terms ‘extracellular’, ‘secreted’, and ‘hydrolase’,
among others (Additional file 1: Table S5). These terms could be related to a wide array of biological processes,
but it is plausible that some of the genes might control mechanisms of heteromorphic
self-incompatibility typical of distylous Primula species. Self- and intra-morph pollen tube inhibition in Primula is thought to result from biochemical interactions between the pollen grain or pollen
tube and tissues of the stigma and/or style 82],83]. Such biochemical interactions are likely to occur in the extracellular matrix or
at the stigmatic surface, thus requiring the active secretion of molecules involved
in the heteromorphic self-incompatibility mechanism 52],84]-87].

Figure 2. Distribution of genes differentially expressed in L- and S-morph plants assigned to
functional annotation clusters.
Annotation clusters are labeled with a single GO term that is representative of all
of the terms defining cluster membership. See Additional file 1: Table S5 for a complete listing of GO terms associated with each annotation cluster.

The remaining 16 annotation clusters contain GO terms associated with 25% to approximately
3% of the candidate genes; among these are a few clusters that stand out as being
potentially relevant to the phenotypic differences associated with the two floral
morphs. Specifically, annotation clusters 2 and 7 are characterized by GO terms associated
with cell-wall organization and cell growth, respectively. These clusters are particularly
interesting in the context of morph-specific development, because many of the phenotypic
differences that define the L- and S-morph flowers involve differential cell proliferation
and elongation in the corolla tube (the A gene) and the style (the G gene, 12],48],88]). Annotation cluster 15 is characterized by the terms ‘post-embryonic development’,
‘reproductive structure development’, and ‘reproductive developmental process’ (Additional
file 1: Table S5). Floral development genes frequently act as transcription factors regulating
the expression of a diverse suite of genes that influence the development of specific
floral structures 89]. It is thus possible that the genes associated with the GO terms of annotation cluster
15 could be coordinating the developmental changes associated with differential cell
elongation observed in corolla tubes and styles of L- and S-morph plants.

GLO1 GLO2: Duplicated GLOBOSA homologs in Primula

One of the most intriguing results of our differential expression analysis in P. veris and P. vulgaris is the fact that BG8816696, the gene with the most significant morph-specific differential
expression (as measured by FDR-adjusted P value; Additional file 1: Table S4), is strikingly similar to the B-function MADS-box gene GLOBOSA (E-value?=?0), which is a transcription factor integral in the normal development of
petals and stamens in Antirrhinum majus90]. Blast searches reveal that this gene had been previously sequenced from P. vulgaris by Li et al.60] and from P. denticulata by Viaene et al.91]. While Li et al.60] considered this gene to be single copy, Viaene et al.91] realized that there were actually two GLOBOSA homologs in P. denticulata. When we map the GLOBOSA homologs from P. vulgaris (PvGLOP1, PvGLOP2, and PvGLOT) and P. denticulata (PdGLO1 and PdGLO2) to our genome assembly, we find unequivocal support for two distinct GLOBOSA genes in Primula. Specifically, we find that Li et al.’s 60] two putative L-morph alleles (named PvGLOP1 and PvGLOP2 by the authors) and PdGLO191] map to an entirely different scaffold than the putative S-morph allele (PvGLOT60]) and PdGLO291]. We hereby refer to these two genes as PveGLO1 (BG8816827), which is homologous to PdGLO1 and the two alleles PvGLOP1 and PvGLOP260], and PveGLO2 (BG8816696), which is homologous to PdGLO2 and PvGLOT (Figure 3). Viaene et al.91] failed to find evidence for two GLOBOSA homologs in a monomorphic species of Primulaceae (Cyclamen persicum), and thus it is possible that the duplicated GLOBOSA could be specific to the genus Primula. By comparing the mRNA sequences of PvGLOT60], PdGLO291], and PveGLO2 assembled from S-morph RNAseq data with the scaffold sequence carrying PveGLO2 (Contig1404), we find that the last exon and 3’ UTR appear inverted in the genome
assembly. This inversion in the scaffold sequence is likely the product of an assembly
error, because the transcriptome assembly contains a complete PveGLO2 gene and there is a large (approximately 10 Kb) assembly gap within the last intron
of the PveGLO2 gene model. PveGLO2 is completely silenced in L-morph flowers of P. veris and P. vulgaris, which – in light of Li et al.’s 60] finding of linkage between PveGLO2 (PveGLOT in their terminology) and the S-morph haplotype of the S-locus in P. vulgaris – could reflect the absence of this gene from the L-morph haplotype. Thus, while
we currently do not have evidence for PveGLO2 being linked to the S-locus in P. veris (see below), this duplicated B-function MADS-box gene is an attractive candidate
for future studies of distylous flower development in Primula.

Figure 3. Sequence alignment and tree showing the relationship ofGLOBOSA/PISTILLATAgenes in Primulaceae. (A) Amino acid positions that are shaded are identical to PveGLO2. (B) The tree was rooted with the PI gene sequence from the kiwifruit draft genome (A. chinensis, 69]). Branch support values were calculated based on 1,000 neighbor-joining bootstrap
replicates. Our labelling of GLO1 and GLO2 follows the convention established by Viaene et al.91]. See text for details regarding gene names.

Bulk Segregant RAD-Seq identifies SNPs linked to the P. veris S-locus

To enable testing of the above hypotheses about properties of S-locus linked sequences, we aimed to experimentally identify corresponding scaffolds.
To this end, we searched for sequence polymorphisms tightly linked to the S-locus in P. veris, using a RAD-seq bulk segregant analysis on pooled L-morph and S-morph DNA. Any polymorphisms
that are closely linked to the S-locus should be largely homozygous in the L-morph pool, but heterozygous in the S-morph
pool. Using a stringent set of criteria (coverage in both pools 60×, allele frequency
in L-morph pool 1.0, allele frequency in S-morph pool 0.3-0.7), this analysis identified
24 SNPs as potentially tightly linked to the S locus (see Additional file 6).

To enable high-throughput genotyping, the SNPs identified above were converted to
PCR-based markers, either cleaved-amplified polymorphic sequences (CAPS) or derived
CAPS (dCAPS). These markers were initially tested on pooled L- or S-morph DNA. Out
of the 24 markers, 13 allowed for robust amplification and detected the predicted
polymorphisms between L- and S-morph pools. These markers were next used to genotype
48 individuals from a natural P. veris population in Potsdam, Germany, and based on these data, six markers were found to
be linked to the S-locus, while the remaining seven showed no clear correlation between marker genotype
and S-locus phenotype (that is, floral morph; Figure 4). Inspection of the genotypes in Figure 4A indicates that markers 9274, 37812, 41358, 51955, 59102, and 101982 are completely
linked to the S-locus when considering only the L-morph plants. The genotypes of S-morph plants suggest
that in the population there are in fact two common chromosomes harboring the dominant
S-allele (S-morph). The first is found in phenotypically S-morph plants that are heterozygous
for all six markers in question and thus carries S-morph specific alleles at these
markers; this chromosome will be termed the ‘original S-morph chromosome’ below. The
other common chromosome is found in phenotypically S-morph plants homozygous for the
L-morph allele at markers 9274, 37812, 41358, and 101982. Hence, a recombination event
appears to have combined the dominant S allele with the recessive s (L-morph) alleles at these four markers, and the resulting ‘recombinant S-morph chromosome’
appears to have spread in the local population. No recombination was found among the
four markers 9274, 37812, 41358, and 101982. Thus, a tentative, relative marker order
deduced from these 48 samples is: (S-locus/51955/59102) – (37812/101982/9274/41358), where the markers/loci in brackets
cannot be separated from each other. From these six markers we chose four (51955,
59102, 37812, 101982) and genotyped 91 additional individuals (Figure 4B). No further recombinant genotypes beyond the ones found in the first 48 plants
were detected, thus confirming the deduced order of loci. By blasting the six loci
to the P. veris genome assembly, we find that markers 37812 and 101982 are both located on Contig437,
a result consistent with their complete linkage in the mapping results. Notably, we
also find that markers 9274 and 59102 are separated by approximately 70 Kb on Contig273.
Our mapping results suggest that the recombination event that may have given rise
to the ‘recombinant S-morph chromosome’ could have occurred between 9274 and 59102
on Contig273, and this region can thus be considered a functional boundary of the
S-locus. We also note that the absence of any detectable recombination between four
markers (37812, 101982, 9274, and 41358) located on three different scaffolds of considerable
size outside of the functional S-locus indicates that the suppression of recombination extends over large distances
beyond the S-locus (see Table 2).

Figure 4. Identification ofS-locus linked SNPs. Graphical representation of genotypes of (A) 48 P. veris plants genotyped for 13 PCR-based markers derived from the bulk segregant RAD sequencing
analysis and three SNPs in SLL1, and (B) 91 P. veris plants genotyped for four PCR-based SNP markers and three SNPs in the SLL1 gene. Phenotype: yellow?=?L-morph, green?=?S-morph. Genotypes: yellow?=?homozygous
for allele 1, green?=?heterozygous, blue?=?homozygous for allele 2.

Table 2. Genome scaffolds putatively linked to theP. veris S-locus

To compare the resolution afforded by our mapping in the natural P. veris population to previous reports, we isolated the P. veris locus orthologous to the previously described PvSLL1 locus from P. vulgaris49]. Comparing the sequences of the P. veris and the P. vulgaris SLL1 locus indicated that one of the internal repeats found in P. vulgaris is replaced by an unrelated sequence in P. veris (Additional file 7: Figure S3). Sequencing this locus from pools of L- and S-morph individuals identified
three polymorphic SNPs that were converted to PCR-based genotyping markers. Testing
the 139 samples above for these three SNPs indicated no association of any of the
SNPs with the S-locus genotype, nor an association of any of the resulting seven haplotypes present
in the sample with the S-locus genotype (Figure 4). Although there was still significant linkage disequilibrium between the three SNPs
(Additional file 8: Figure S4; ?2-test; P 0.0001), we identified a substantial proportion of recombinant genotypes, even though
the three SNPs are all found within less than 200 bp. This suggests that the studied
population captures a very large number of historical recombination events resulting
in an absence of LD between PvSLL1 and the S locus. Conversely, it indicates that the very tightly linked markers identified above
are likely to reside within or immediately next to the S-locus.

Genome-wide estimates of heterozygosity and polymorphism

Heterozygosity and polymorphism were explored in the P. veris genome by resequencing a single P. veris individual from a Latvian population and mapping these raw sequence reads to our
genome assembly. We found a total of 1,184,748 SNPs (0.51%) composed of 982,590 heterozygous
sites and 202,158 homozygous sites compared to the reference assembly, and a total
of 96,113 indels were identified in the Latvian P. veris individual (see Additional file 9).

Theory predicts that the S-locus will be largely shielded from recombination 92], and thus S-linked genome scaffolds are expected to show elevated rates of heterozygosity in
S-morph plants, which are heterozygous at the S-locus (Ss), compared to L-morph plants, which are homozygous at the S-locus (ss). We evaluated this prediction by mapping RNAseq reads from L- and S-morph
plants of P. veris to our genome assembly to compare the number of SNPs predicted in coding sequences
of genome scaffolds that are putatively linked to the S-locus. We found a total of 45,018 SNPs when mapping the L-morph RNA to the genome
assembly, and 51,191 SNPs when mapping the S-morph RNA to the genome assembly, and
the average S-morph/L-morph SNP ratio for all genome scaffolds is 2.54 (variance?=?23.78).
Table 2 shows the results of this analysis for specific genome scaffolds carrying genes that
have been previously identified as linked to the S-locus in P. vulgaris: PvSLL1, PvSLL249], PveGLO1, and PveGLO260]. Additionally, we include the genome scaffolds carrying the six loci that are linked
to the S-locus based on our RAD-seq bulk segregant analysis and fine mapping experiments (see
above): RAD loci 41,358, 37,812/101,982, 51,955, and 9,274/59,102. Our results show
that, on average, there are more SNPs mapped from S-morph RNA reads than L-morph RNA
reads on putatively S-linked scaffolds (Table 2). By utilizing RNAseq reads rather than whole genome sequence data, our analysis
is limited to polymorphisms evident in coding regions alone. Additionally, if certain
genes experience high degrees of allele-specific expression, we will fail to observe
polymorphisms solely due to an absence of mapped reads. It is important to note here
that Contig1404, which carries just one gene (PveGLO2), contains no polymorphism when mapping either L- or S-morph RNAseq reads: this is
the expected result if PveGLO2 were S-linked and only expressed in the S-locus of the S-morph. The expectation of increased heterozygosity is fundamentally
based on the assumption that individuals with an L-morph phenotype will also be highly
homozygous at the S-locus. But if recombination is suppressed between L-morph alleles as well, then one
might actually expect that linked polymorphisms would accumulate between different
L-morph haplotypes, thus leading to elevated heterozygosity at the S-locus in L-morph as well as S-morph plants. Addressing this question is beyond the
scope of the current study, but future analyses aimed at more precisely characterizing
genomic patterns of heterozygosity in and around the S-locus will benefit greatly from our draft genome assembly of P. veris.