This research investigates how to determine whether two rankings come from the same distribution. We evaluate three hybrid tests: Wilcoxon's, Dietterich's, and Alpaydin's statistical tests combined with cross-validation (CV), each operating with folds ranging from 5 to 10, thus altogether 18 variants. We have applied these tests in the framework of a popular comparative statistical test, the Sum of Ranking Differences that builds upon the Manhattan distance between the rankings. The introduced methodology is widely applicable from machine learning through social sciences. To compare these methods, we have followed an innovative approach borrowed from Economics. We designed nine scenarios for testing type I and II errors. These represent typical situations (that is, different data structures) that CV tests face routinely. The optimal CV method depends on the preferences regarding the minimization of type I/II errors, size of the input, and expected patterns in the data. The Wilcoxon method with eight folds proved to be the best for all three investigated input sizes. Although the Dietterich and Alpaydin methods are the best in type I situations, they fail badly in type II cases. We demonstrate our results on real-world data, borrowed from chess and chemistry. Overall we cannot recommend either Alpaydin or Dietterich as an alternative to Wilcoxon cross-validation.
翻译:这项研究调查了如何确定两个排名是否来自同一分布。 我们评估了三种混合测试: Wilcoxon's, Dittelich's和Alpaydin的统计测试,加上交叉校验(CV),每个测试都以5到10的折叠进行,因此总共是18个变量。我们在一个流行的比较统计测试、基于曼哈顿排名之间的距离的排名差异总和的框架内应用了这些测试。引入的方法从机器学习到社会科学广泛适用。为了比较这些方法,我们采用了从经济学借来的创新性方法。我们设计了用于测试第一和第二类错误的九种方案。这些是CV通常面临的典型情况(即不同的数据结构 ) 。 最佳的CV方法取决于关于尽量减少一/ 二类误差、输入大小和数据预期模式的偏好选择。 8个折的Wilcoxon方法被证明是所有三种调查输入大小的最佳方法。 尽管Ditterrich和Alpaydin 方法在第一类情况中是最好的方法,但在二类测试中是最好的方法,但是它们在二类测试中不能严重地使用。