PERGOLA: fast and deterministic linkage mapping of polyploids

Application to real allotetraploid data

We applied PERGOLA to allotetraploid genotypes of a peanut crossing population [9]. The dataset originates from doubled haploid (DH) pedigrees and behaves similar to diploids. Application of PERGOLA resulted in a linkage map consisting of ten linkage groups (see left side of Fig. 3). That matches the expected number of chromosomes for peanut known from literature [29]. Further validations are difficult because the real linkage map is unknown.

https://static-content.springer.com/image/art%3A10.1186%2Fs12859-016-1416-8/MediaObjects/12859_2016_1416_Fig3_HTML.gif
Fig. 3

Global linkage map comparison – PERGOLA and JoinMap®;. Comparison of the linkage map created by PERGOLA and JoinMap®;. Both are split into ten linkage groups, highlighted by different shades of gray. The linkage groups consist of the same markers. White spaces indicate differences in the marker ordering

However, the diploid nature of the peanut dataset allowed us to compare the results and performance of PERGOLA to linkage mapping tools, which do not support polyploids. We selected MapMaker, JoinMap®; and R/qtl. MapMaker was used by the authors of the peanut dataset [9] and the results are publicly available. Runtimes are not provided by the authors and would not be informative as the computational setup is not comparable. JoinMap®; is one of the most popular linkage mapping tools [18]. However, it is neither open-source, nor open-access and only works on Windows systems. R/qtl is publicly available as R-package and allows reproduction of our comparison. More linkage mapping tools are available, but software comparison is not the main subject of this publication.

Comparing linkage mapping tools is difficult because depending on the parameter settings each tool can output different maps. We used the default parameters of each tool and the Haldane function to calculate the spacing between the markers. The results gave a general impression of the performance and should not be overinterpreted. All tools could be applied in multiple ways and lead to different maps. The motivation of the comparison was to find out if PERGOLA performs worse than the other tools for a diploid-like data set. For polyploid data sets the other tools can not be applied and PERGOLA is the method of choice.

The runtime of MapMaker is unknown because the authors of the peanut dataset did not provide computation times. Data preparation is unique for every tool and depends on the format of the given data. Thus, we excluded that step from the time measurement. Linkage grouping was at most a matter of seconds for all methods and has been ignored. The computational time comparison focuses on marker ordering because it is the most expensive and distinctive step. In R/qtl, JoinMap®; and PERGOLA these are the commands orderMarkers(), Calculate map and sortLeafs(), respectively. R/qtl is the slowest one and took 16 min and 47 s. JoinMap®; had a similar runtime of 14 min and 47 s. PERGOLA was the fastest method and took 0.011 s. The better performance results from the use of the OLO algorithm compared to the sliding window approach in R/qtl and the large overhead in JoinMap®;. Runtime performance is important because linkage maps have many parameters (e.g. filter criteria) that influence the result. Faster methods allow for systematic optimization of linkage maps. For instance, usually the number of chromosomes is known. If a parameter setting results in a number of linkage maps that differs from the expected chromosome number, the setting should be changed. The runtime of PERGOLA allows for computationally expensive resampling methods (e.g. jackknifing or bootstrapping) to be used. That can improve the interpretability of linkage maps and related QTL detections.

In PERGOLA and JoinMap®; we manually selected ten linkage groups because they were suggested in the grouping step. R/qtl created these linkage groups automatically. We used the Haldane mapping function in all tools. R/qtl applies a sliding window approach where all possible permutations of markers are calculated and compared. That approach leads to locally optimized solutions, but can fail to find the best marker order within the linkage group. The default window size is seven, but performs better if the window size is increased. However, this would lead to even slower computation times and was not tested. JoinMap®; performs similar to R/qtl, but uses a more sophisticated approach. It calculates and compares different solutions internally and outputs the best solution to the user.

To compare the general linkage maps we transformed all maps into dendrograms. We aligned the chromosome orders and orientation between the maps. Dendrograms maintain the grouping, ordering and spacing of the maps and allow manual (visually) and computational comparisons. The root line connects the multiple linkage groups at the same height. In our implementation of PERGOLA its height is 0.2 times higher than the highest connection within the linkage groups. It does not reflect their similarity, but supports the readability of the map. The marker order and spacing in the map equal the leaf order and branch height in the dendrogram. We created tanglegrams from the dendrograms for a pairwise comparison of all maps [30]. They allow us to observe differences in the grouping, i.e. whether the same set of markers is in the same linkage groups. The branching height in the dendrogram provides information about the spacing. Traditionally linkage maps are represented as bars or lines. Each bar represents one linkage group from one map. Lines between the bars indicate the rearrangements between two maps. The linkage groups are distributed so that collisions are minimized. However, for large numbers of linkage groups and high marker density maps, that representation is difficult to interpret. The transformation into a tanglegram is possible without a loss of information, but with a gain in clarity. The spacing information moves into the horizontal dimension and can be explored separately. Markers which are not included in both maps are not shown because they do not contribute to comparison. However, their number should be provided along with the tanglegram. An example tanglegram is shown in Fig. 3 and others are provided in the Additional file 1.

The pairwise tanglegrams show that the maps are generally similar. All maps consist of ten linkage groups, mainly containing the same markers. The maps by R/qtl and MapMaker contain five and six markers more than PERGOLA and JoinMap®;, respectively. This information is not illustrated in the tanglegrams. The markers have been filtered out and could not be integrated into the ten linkage groups. The total number of markers in the dataset is 459. It is unknown how many have been filtered out for the MapMaker map because they have not be provided together with the map. However, the marker density is not significantly reduced by the filtering. The quality of the map is more important, than a small number of additional markers. Thus, noisy markers should be filtered out rather than creating large gaps in a linkage group.

In our experiment, the Goodman-Kruskal-gamma index for all pairs of maps is almost 1, indicating perfect correlation. This contradicts the observations we made in the tanglegrams where we observe differences between the linkage maps. Marker grouping has a much larger effect on the Goodman-Kruskal-gamma index than ordering or spacing and if many markers are grouped similarly, differences in the latter steps are not represented by it. We conclude that the Goodman-Kruskal-gamma index is too insensitive for the allotetraploid data set. This is also supported by our simulation study. In contrast the cophenetic correlation coefficient provides reasonable measurements between the maps, as shown in Table 3.

Table 3

Pairwise correlations between the four maps. The bottom and top triangles show cophenetic and Goodman-Kruskal correlations, respectively

The results show that PERGOLA calculates linkage maps in a fraction of the time of the other methods. That makes it not just a useful method for polyploid crops, but also as an alternative for diploid datasets. The heuristic approach of the recombination calculation leads to minor rearrangements in the grouping. They can be neglected given the overall map similarity and performance advantages of PERGOLA. The tanglegrams suggest a higher similarity between R/qtl, JoinMap®; and MapMaker because the grouping is identical. On the contrary, the cophenetic correlation indicates that the map by JoinMap®; is more similar to the PERGOLA map. That supports our aforementioned hypothesis, that there is not one correct linkage map and we can only estimate the biological situation from different directions. Depending on the input data, filtering parameters, linkage mapping tools and validation criteria, multiple maps are valid. Currently, it is impossible to discard one map or choose one over the other.

We conducted a simulation study to validate the results of PERGOLA and R/qtl for diploids where the real map is known. JoinMap®; was excluded because it is limited to a graphical interface and could not be automatically applied to the hundreds of simulated datasets. We used two different numbers of markers (10 and 20 per chromosome) and three population sizes (50, 100, 200), which resulted in 6 different combinations per tool. Each combination was repeated 100 times. The input linkage maps consisted of two chromosomes and randomly spaced markers. We compared the reference maps with the calculated ones using cophenetic and Goodman-Kruskal correlation. The mean values and standard errors of the cophenetic correlation are shown in Fig. 4. PERGOLA and R/qtl perform similarly for 10 marker maps independently of the population size. For setups with 20 markers per chromosome the sliding window approach of R/qtl reaches its limits and the linkage maps differ significantly. Taken together, PERGOLA performs better not only computationally, but also produces better linkage maps for diploids.

https://static-content.springer.com/image/art%3A10.1186%2Fs12859-016-1416-8/MediaObjects/12859_2016_1416_Fig4_HTML.gif
Fig. 4

Diploid simulation study result. We simulated six setups of diploid populations with two chromosomes and repeated each 100 times. We used population sizes of 50, 100 or 200 and 10 or 20 markers per chromosome. We applied PERGOLA and R/qtl to calculate linkage maps which were compared with the reference map. The bars show the mean correlation value of 100 repetitions and the error bars indicate the standard error