Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss a real-world example; a Kaggle competition from 2020.
翻译:机器学习方法通常通过在公共数据集上的表现进行评估和比较。这使得多个方法,通常是几千个,在相同的条件下进行评估并进行跨时间比较。在问题上排名最高的性能被称为最先进的(SOTA)性能,它被用作新方法发表的参考点。使用最高排名性能作为SOTA估计是一种有偏的估计,会给出过于乐观的结果。在SOTA估计中起作用的机制是多样性,在多重比较和多重测试的背景下得到了广泛的研究,但就作者所知,这一话题在关于SOTA估计的讨论中几乎没有被提到过。乐观的SOTA估计被用作评估新方法的标准,而具有实质性差异结果的方法很容易被忽略。在本文中,我们提供了一个针对多个分类器的概率分布,以便可以使用已知的分析方法并提供更好的SOTA估计。我们通过独立分类器的模拟例子展示了多样性的影响。我们展示了分类器的依赖性如何影响方差,但同时也表明当准确性很高时影响是有限的。最后,我们讨论了一个来自2020年的Kaggle竞赛的真实案例。