This research investigates how to determine whether two rankings can come from the same distribution. We evaluate three hybrid tests: Wilcoxon's, Dietterich's, and Alpaydin's statistical tests combined with cross-validation, each operating with folds ranging from 5 to 10, thus altogether 18 variants. We have used the framework of a popular comparative statistical test, the Sum of Ranking Differences, but our results are representative of all ranking environments. To compare these methods, we have followed an innovative approach borrowed from Economics. We designed eight scenarios for testing type I and II errors. These represent typical situations (i.e., different data structures) that cross-validation (CV) tests face routinely. The optimal CV method depends on the preferences regarding the minimization of type I/II errors, size of the input, and expected patterns in the data. The Wilcoxon method with eight folds proved to be the best under all three investigated input sizes, although there were scenarios and decision aspects where other methods, namely Wilcoxon~10 and Alpaydin~10, performed better.
翻译:这项研究调查了如何确定两个排名是否来自同一分布。 我们评估了三种混合测试: Wilcoxon, Dittelich's 和 Alpaydin 的统计测试,加上交叉校验,每个测试都以5到10的折叠进行,因此总共是18种变量。 我们使用了流行的比较统计测试框架,即分级差异总和,但我们的结果代表了所有排名环境。为了比较这些方法,我们采用了从经济学中借用的创新方法。 我们设计了8种测试I类和II错误的假设和决定方面。 这些是交叉校验(CV)通常面临的典型情况(即不同的数据结构 ) 。 最佳的CV方法取决于关于尽可能减少I/II型误差、输入大小和数据预期模式的偏好。 使用8个折的Wilcoxon 方法证明在所有3个调查的投入大小中都是最好的, 尽管存在其他方法( Wilcoxon~ 10 和 Alpaydin~ 10) 的假想和决定方面, 表现得更好。