Analysis of genetic differentiation and genomic variation to reveal potential regions of importance during maize improvement


Ascertainment bias

The average correlation coefficients of the first five principal components (PCs)
between one given subset and the entire set with all markers are shown in Additional
file 1: Figure S1. The correlation coefficients between the subset and the entire set sharply
increased from 0.65 for a marker number of 500 to 0.97 for a marker number of 700.
A second sharp increase emerged when the marker number increased from 800 to 1000,
with a corresponding increase in the correlation coefficient from 0.97 to 0.99. Furthermore,
the correlation coefficient did not significantly change when the marker number increased
from 2000 to 43,252. The results indicated that 1,000 SNPs might be sufficient for
population structure analyses.

Model-based population structure

The subpopulations of 1857 accessions based on the admixture model-based algorithm
were analyzed in depth using the even distribution of 5000 SNPs. The results are depicted
in Fig. 1. The delta K (?K) peak was maximized when k?=?2 (Fig. 1a), indicating that the accessions could be categorized into two groups: tropical/subtropical
(TS) germplasm and temperate germplasm (Fig. 1b k?=?2). A second peak of ?K emerged at k?=?4 (Fig. 1a), indicating that this panel could be further divided into four subgroups: SS, NSS,
Modified Introduction in China (MICN), and TS I (Fig. 1b k?=?4). Notably, MICN formed during the long history of maize breeding in China because
Chinese maize breeders have developed a number of inbred lines derived from Chinese
landraces and U.S. hybrids. These varieties significantly differ from U.S. inbreds
19]. A third peak of ?K was observed at k?=?7 (Fig. 1a), indicating that this panel could be comprehensively categorized into seven subpopulations,
each including one of the following representative lines: B73, Huangzaosi (HZS), 207,
Oh43, Mo17, Shen137, and some from TS regions (Fig. 1b, k?=?7). Detailed information for each accession is listed in Additional file 2: Table S1.

Fig. 1. Model-based subdivision of population structure. ‘a’ presents the estimation of the Ln (probability of data). Delta K was calculated
from K?=?2 to K?=?9. ‘b’ presents the population structure of the 1,857 maize accessions deduced by membership
coefficients (Q values). Each horizontal bar presents one accession, which is consisted
of K colored segments. ‘SS’ is the abbreviation of Stiff Stalk Synthetic group, “MICN”
Modified Introduction of China, ‘TS’ Tropical/Subtropical group, and NSS Non-Stiff
Stalk

Clustering analysis

A neighbor-joining tree was constructed based on the modified Euclidean distance and
is shown in Fig. 2. The 1857 accessions were clustered into two major groups according to their origins:
the TS and Tem-tropic subpopulation. The TS subpopulation contained 525 accessions,
including 195 accessions from Mexico, 187 from the U.S., 77 from China, 17 from Sudan,
10 from Thailand, 9 from Canada, 9 from Tanzania, 6 from Nigeria, 3 from Somalia,
3 from Benin, 3 from Zambia, 3 from Chad, 2 from Spain, 1 from Ghana, 1 from Germany,
1 from Yugoslavia, and 1 from Egypt (Additional file 2: Table S1). The Tem-tropic subpopulation contained 1,332 accessions, which could
be further clustered into four subpopulations, SS, NSS, Iodent (IDT) and TS, according
to their origins and pedigrees. A further analysis showed that the accessions from
these four subpopulations could be clustered into 13 subgroups, with the following
representative lines: Reid Yellow Dent, Oh43, A634, 207, B37, B73, PHG39, Shen137,
Huangzaosi (HZS), Ye8112, E28, GB and Mo17 (Fig. 2).

Fig. 2. Neighbor-joining trees of the 1,857 maize accessions. Mo17 is a representative line
of Non-Stiff Stalk (NSS). GB is a representative line derived Chinese landrace. E28
is a representative line of the Ludahonggu group. Ye8112 a representative line of
the Modified Reid group. ‘HZS’ is an abbreviation of Huangzaosi, which is a representative
line of the Tangsipingtou group (TSPT). Shen137 is a representative line of the PA
group. PHG39 is a parent derived from Argentine Maize Amargo background. B73 is a
representative line of Stiff Stalk Synthetic (SS). B37, 207, A634, Oh43, and Reid
Yellow Dent are the representative lines of different subpopulations, respectively

Principal component analysis (PCA)

The PCA results showed comprehensive patterns of subpopulation and a good agreement
with both model-based population structure and clustering analyses (Fig. 3). The entire panel of 1857 accessions exhibited moderate differentiation and some
overlap between the temperate and TS germplasm; representative lines from the TS and
temperate region significantly differed, e.g., B73 from the temperate and Ki3 from
the TS region of Thailand, but the accessions from the adjacent regions did not markedly
differ. Which may be resulted by the lager introgression existing between temperate
and tropical/subtropical accessions and lower power of PCA in population structure
analysis by using only two PCs. The accessions from the temperate subpopulation were
further categorized into the B73 subpopulation according to the results of model-based
structure analysis (Fig. 3b) or the Ye8112, B37 and A634 subpopulations based on the results of modified Euclidean
distance (Fig. 3c). Based on the pedigrees, most lines were from the U.S. and China (Fig. 3d and Additional file 2: Table S1). In addition, the TS population was further divided into the HZS, 207,
Oh43, Mo17 and Shen137 subpopulations based on the model-based population structure,
which corresponded to HZS, GB, Shen137, Mo17, and Reid Yellow Dent based on a clustering
analysis (Fig. 3c). These subpopulations contained inbred lines of a TS lineage in their pedigrees
or lines from CIMMYT, Mexico and other tropical regions (Fig. 3a and d). Moreover, many accessions were categorized into new groups, such as the PHG39,
207, A634, Oh43, B37 and E28 subpopulations; most accessions in these groups originated
from regions between temperate and TS zones (Fig. 3) due to the introgression of TS genotypes into regions of temperate germplasms.

Fig. 3. Results of principal components (PCs). Plots ‘a’ and ‘b’ show the comparison between the model-based population structure and the PC analysis
results. Plot ‘c’ shows the comparison between the PC analysis results and the N-J tree constructed
based on modified Euclidean distance. Plot ‘d’ shows the comparison between the original information and the PC analysis results

Summary statistics of genetic diversity

The accessions of the entire panel of 1857 accessions were moderately similar, with
more than 96.22 % of the pair-kinship coefficients varying from 0.30 to 0.53 (Fig. 4a). The average linkage disequilibrium (LD) distance was 30 kilo-bases (kb), varying
from 20 to 50 kb, with an r
2
exceeding 0.1 (Fig. 4b). Combining the results of both the model-based population structure and genomic
variation analyses indicated pronounced patterns of genetic variation among different
subpopulations. These patterns were fixed by artificial or natural selection and resulted
in the division of subpopulations during breeding. The TS subpopulation was more genetically
diverse than the temperate subpopulation, with gene diversities (GDs) of 0.364 and
0.284, respectively, and polymorphism information contents (PICs) of 0.281 and 0.231,
respectively (Table 1). Similar trends were validated with a smaller proportion of SNPs in LD for TS when
comparing with a larger proportion of SNPs in LD for the temperate subpopulation (Fig. 4c).

Fig. 4. Summary statistics of genetic variation existing in the whole set of accessions. ‘a’ is a picture of pair-wise kinship of the 1857 accessions. ‘b’ displays the decay level of linkage disequilibrium (LD) on different chromosomes
and across the whole genome. ‘c’ shows the comparison of LD level between different subpopulations. ‘d’ pictures the genomic differentiation on Chromosome 8

Table 1. Summary statistics of genetic diversity

Genomic differentiation between subpopulations

The proportion of genetic variance due to subpopulations (D
EST
) was measured to interpret the genomic variation between subpopulations (Table 2, Fig. 4(d), Fig. 5 and Additional file 1: Figure S2). The D
EST
indicated different patterns of genomic differentiation between the subpopulations,
ranging from 0 to 0.39 between TS and Temperate (average 0.08), from 0 to 0.45 between
TS I and SS (average 0.09), from 0 to 0.45 between SS and NSS (average 0.07), from
0 to 0.41 between NSS and MICN (average 0.05), from 0 to 0.38 between MICN and TS
I (average 0.06), from 0 to 0.30 between NSS and TS I (average 0.03), and from 0 to
0.57 between SS and MICN (average 0.08). The SS and TS I varieties were more differentiated,
with 332 genomic regions having a large D
EST
(twice the average level) (Fig. 5a). Furthermore, 250 genomic regions were highly differentiated between SS and MICN,
235 were highly differentiated between TS and Temperate, 92 were highly differentiated
between MICN and TS I, 51 were highly differentiated between NSS and MICN, and 8 were
highly differentiated between NSS and TS I, with a D
EST
of more than twice the average level. Most importantly, 85 highly differentiated regions
with a D
EST
exceeding 0.2 were identified between the TS and the temperate subpopulations. Of
these 85 regions, 68 were located within the interval of plant architecture or FT-related
QTL, and two regions were closely linked to vgt1 and zcn8 (Additional file 2: Table S2 and S3). Furthermore, a number of special genomic regions were also found
to be highly differentiated. In particular, subpopulation pairs and common regions
were identified among different population pairs (Fig. 5b). In total, 303 genomic regions with a high D
EST
of more than 0.2 were detected, and these regions were located within 197 FT- or plant
architecture-related QTL. For example, the region containing the tag SNP PZE.108075114
differed more between the TS and temperate subpopulations and was associated with
a D
EST
of 0.32; this region was located within an FT-related QTL cluster and contained the
flanking markers PHTi060 and bnlg1599 (Additional file 2: Table S3).

Table 2. Variation of D
EST
between subpopulations

Fig. 5. Counts of genetic regions with high differentiation. ‘a’ shows the counts of genomic regions for each subpopulation pair. ‘b’ shows the comparison of genomic regions with high differentiation among different
subpopulation pairs

Genome-wide study of FT-related traits

The phenotypes of FT-related traits were significantly positively correlated between
the environments (Additional file 1: Figure S3). Thus, the BLUPs for each accession across the three environments were
calculated, and the phenotype-genotype associations were analyzed. To validate the
putatively adaptive function of highly differentiated target regions, we used the
FT-related traits DTT, DTS, and DTP to perform a GWAS with 43,252 SNPs as a case study.
The results indicated that some highly differentiated genomic regions were associated
with FT-related traits. For example, the SNP of PZE-108070380 was significantly associated
with DTT (P?=?7.05?×?10
?14
), DTP (P?=?2.57?×?10
?9
) and DTS (P?=?2.12?×?10
?8
) (Fig. 6). This SNP was located within the zcn8 gene, which is involved in maize migration from tropical to temperate regions 31]. The SNP PZE-108076585 was significantly associated with DTS (P?=?1.43?×?10
?11
) (Fig. 6). This SNP was located within the vgt1 gene, which is involved in maize adaptation 32]. Furthermore, twelve other SNPs were also strongly associated with FT-related traits
(Fig. 6), and the regions surrounding these SNPs were more differentiated than the rest of
the genome (Fig. 4d, Additional file 1: Figure S2, Additional file 2: Table S2).

Fig. 6. Manhattan plot of GWAS results for flower time related traits. Red cycle refers to
days to pollen-shedding (DTP), blue cycle shows days to silking (DTS), and green cycle
shows days to tasseling (DTT). Red line shows the cutoff value of 5.94 (defined as:
?log
10
(0.05/43,252))