Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers

The Friedman [1] rank sum test is a widely-used nonparametric method for the analysis of several related samples in computational biology and other fields. It is used, for example, to compare the performance results of a set of (expression-based) classifiers over multiple datasets, covering case problems, benchmark functions, or performance indicators [24]. Some recent examples of the numerous applications of the Friedman test in bioinformatics include [517]. The test is supported by many statistical software packages and it is routinely discussed in textbooks on nonparametric statistics [1823].

The Friedman test procedure is an analysis of variance by ranks, i.e., observed rank scores or rank scores obtained by ordering ordinal or numerical outcomes, that is used when one is not willing to make strong distributional assumptions. A common approach is to present the test as a method for identifying treatment effects of k different treatments in a so-called randomized complete block design. This design uses n sets, called blocks, of k homogeneous units matched on some relevant characteristic, for example patients matched on SNP genotype. The k treatments are assigned randomly to the k units within each block, with each treatment condition being administered once within a block. The Friedman test is also conducted if the samples concern a repeated measures design. In such design each experimental unit constitutes a block that serves in all treatment conditions. Examples are provided by experiments in which k different treatments (classifiers) are compared on a single experimental unit (dataset), or if k units (e.g., genes, products, candidates) are ranked in order by each of n ‘judges’ (algorithms, panelists). In all these settings the objective is to determine if the k populations from which the observations were made are identically distributed.

Applied to classifier comparison, the null hypothesis for the Friedman test is that the performance results of the k classifiers over n datasets are samples that have been drawn from the same population or, equivalently, from different populations with the same distribution [18]. To examine this hypothesis, the test determines whether the rank sums of the k classifiers included in the comparison are significantly different. After applying the omnibus Friedman test and observing that the rank sums are different, the next step is to compare all classifiers against each other or against a baseline classifier (e.g., newly proposed method or best performing algorithm). In doing so, a series of comparisons of rank sums (i.e., rank sum difference tests) is performed, adjusting the significance level using a Bonferroni correction or more powerful approaches to control the familywise Type-I error rate [3, 4].

Preferably one should use the exact null distribution of the rank sum difference statistic in these subsequent analyses. Only if the decision on the null hypothesis is based on the exact distribution is the probability of committing a Type-I error known exactly. However, the exact distribution and the associated true tail probabilities are not yet adequately known. To be sure, tables of exact critical values at standard significance levels (e.g., [18, 21, 22]) and of exact p-values [24] are available for small values of k and n, for which complete enumeration is possible. Also, the lower order moments of the exact sampling distribution have been documented in detail [25], and Stuart [26] proved the conjecture of Whitfield [24] that, on the null hypothesis, the difference between rank sum values is asymptotically normally distributed as n tends to infinity. Further, in a recent study Koziol [27] used symbolic computation for finding the distribution of absolute values of differences in rank sums. Apart from these important contributions there is, to the best of our knowledge, no publication available in the probability theory, rank statistics or other literature that addresses the exact distribution of the rank sum difference statistic.

As the null distribution in the general case is unknown and exact computation seemingly intractable, it is generally recommended to apply a large-sample approximation method to test the significance of the pairwise difference in rank sums. It is well known, however, that calculating probabilities using an asymptotic variant of an exact test may lead to inaccurate p-values when the sample size n is small, as in most applications of the Friedman test, and thereby to a false acceptance or false rejection of the null hypothesis. Furthermore, there are several large-sample tests available with different limiting distributions, and these tests may vary substantially in their results. Consequently, there is little agreement in the nonparametric literature over which approximate method is most appropriate to employ for the comparison of Friedman rank sums [22]. This statement refers both to approximate tests of significance for the comparison of all (
2k
)?=?k(k???1)/2 pairs of treatments, and to tests for the comparison of k???1 treatments with a single control. Obviously, the utility of the asymptotic tests depends on their accuracy to approximate the exact sampling distribution of the discrete rank sum difference statistic.

The purpose of this note is twofold. The foremost aim is to provide an expression for calculating the exact probability mass function of the pairwise differences in Friedman rank sums. This expression may be employed to quickly calculate the exact p-value and associated statistics such as the critical difference value. The calculation does not require a complicated algorithm and as it is easily incorporated into a computer program, there is no longer need to resort to asymptotic p-values. We illustrate the exact method in the context of two recently published analyses of the performance of classifiers and data projection methods. The second aim is to examine under what circumstances the exact distribution and the associated exact statistics offer an improvement over traditional approximate methods for Friedman rank sum comparison.

It is important to note at the outset that this article is not concerned with the Friedman test itself. The Friedman test is an over-all test that evaluates the joint distribution of rank sums to examine equality in treatment distributions. Computation of the exact joint distribution under the null is discussed by van de Wiel [28], and an efficient algorithm for computing the exact permutation distribution of the Friedman test statistic is available in StatXact [29]. The present paper offers an over-all exact test for pairwise comparison of Friedman rank sums. The reason is essentially that researchers are usually not only interested in knowing whether any difference exists among treatments, but also in discovering which treatments are different from each other, and the Friedman test is not designed for this purpose. Although the type of test dealt with here is not the same as the Friedman test, we will briefly discuss the latter as the procedures have important elements in common, such as the global null hypothesis. Also, we assume in our discussion that the available data are such that it is appropriate to perform simultaneous rank sum tests. Hence, we ignore empirical issues such as insufficient power (too few datasets), selection bias (nonrandom selection of datasets), and like complications that, as noted by Boulesteix et al. ([30]; see also [31]), tend to invalidate statistical inference in comparative benchmarking studies of machine learning methods solving real-world problems. In ANOVA, the term ‘treatment’ is used as a common term for the grouping variable for which a response is measured. To accommodate the wide variety of applications of the Friedman test, the more general term ‘group’ is used instead of ‘treatment’ in the remainder of this paper. The term subject is used hereafter to include both objects and individuals.