Defining window-boundaries for genomic analyses using smoothing spline techniques


A recurrent question that arises during the analysis of high-density genotyping or
sequencing information is how to best analyze noisy data. This question is particularly
relevant when analyzing sequence data from pooled samples of populations for which,
depending on the number of individuals pooled and the level of coverage per site,
estimates of individual base pair (bp) allele frequencies can be very imprecise 1]. To account for this variability, methods based on estimating parameters over “windows”
have been successfully used to reduce sampling error while retaining true signal in
studies aimed at identifying evidence of selection in populations 2]-5]. In general, window-based techniques treat observations from individual genetic markers,
often single nucleotide polymorphisms (SNPs), as samples that are representative of
a phenomenon that affects isolated regions of the genome instead of independent SNPs.
In studies aiming at identifying selection signatures, genetic hitchhiking 6] makes this approximation quite reasonable. It is also useful for other applications
since, with the availability of increasingly denser marker arrays, linkage disequilibrium
(LD) between SNPs within any particular region is likely to be substantial. Therefore,
a summary statistic can be computed across a region or a window, instead of for individual
SNPs. This summary statistic can be as simple as taking the mean of single-SNP estimates
3] or it can take a more complex form such as an aggregated measurement of divergence
according to the Fisher’s angular transformation 4],7]. By using a sample of observations that are each considered as an estimate of the
same phenomenon, as opposed to treating observations individually, sampling error
may be markedly reduced, while retaining true signal. An inherent assumption of these
methods is that the individual marker estimates within a window are independently
and identically distributed.

Two types of approaches for delineating window boundaries are commonly used. These
are referred to as “distinct windows”, for which markers in different windows do not
overlap, and “sliding windows”, for which they do. When using distinct windows, the
genome is divided into separate segments of equal length, with the length defined
according to either the number of SNPs 4],8], or the number of base pairs (bp) 9]. A summary statistic that captures genomic patterns across each window, such as the
mean FST, is then calculated over all SNPs within a defined window. Distinct windows often
succeed at reducing the sampling error of estimates while reducing the number of statistical
tests performed, but the placement of windows is random or sequential, so power may
be lost if window placement inadvertently splits one meaningful region into adjacent
windows. In a sliding window approach, a window length (again in number of bp or SNPs)
is defined and windows are incrementally advanced along the genome, a single or a
few SNPs at a time, to ensure that every possible window is considered 5],10],11]. However, when using such an approach, the number of tests is not dramatically reduced
since a new window is defined for every SNP or every few SNPs. In addition, highly
correlated statistics are generated, since each window overlaps with its neighboring
windows.

Beyond the limitations mentioned above, in the case of either distinct or sliding
windows, determining the proper window size is typically subjective and researchers
often only loosely justify their choice of size 10],12], or acknowledge that their choice is arbitrary 2],13]. This is unsatisfying for two reasons. First, there should be an optimum window size
that balances noise reduction with signal identification to maximize power, and identifying
this optimum would be ideal. Second, a subjective definition of window size typically
leads to the use of a uniform window size across the genome, which is not appropriate
since various genetic parameters, including recombination rate and LD, vary along
each chromosome.

To address these problems, we have developed an empirically driven framework to define
window boundaries, while simultaneously determining their ideal size. Our method retains
the benefits of distinct windows in that it reduces the total number of tests and
generates window values that are not inherently correlated, while also borrowing from
sliding-windows by reducing the risk of erroneously dividing signal between adjacent
windows. In addition, the ideal window size is automatically chosen and allowed to
vary along the genome. The method is based on first fitting a cubic smoothing spline
14] to single-SNP estimates of a parameter such as FST15]. Previously, various forms of smoothing splines have been used to analyze genomic
information 16],17], but not to define windows. The smoothness of the spline is chosen by leave-one-out
cross-validation, to ensure that it optimally predicts single-SNP values. The second
derivative of the spline is then computed and inflection points are identified. The
inflection points of the fitted spline isolate the positions where the spline switches
from tending towards a local maximum to a minimum, or vice versa, and therefore DNA
between these positions may correspond to a correlated region of the genome. Therefore,
inflection points are treated as window boundaries and a distinct-window analysis
proceeds. Using inflection points to define window boundaries virtually ensures that
any peak in the fitted spline is placed in a single window instead of split across
windows. Determining the fitted spline’s smoothness by cross-validation leads to ideal
window-sizes. Moreover, although a uniform smoothness is chosen for the fitted spline,
this does not explicitly restrict the location of its inflection points, thereby allowing
non-uniformity of window sizes.

In this paper, we describe a smoothing spline-based approach to define windows using
genomic data. In addition, we apply the method to both simulated and real data to
identify signatures of selection, and demonstrate its advantages over previously used
techniques. Although we present this method in the context of FST-based studies, it can be applied to several other approaches that require the pooling
of genotypic data over windows. This method has been implemented in a freely available
R package, GenWin.