Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss a real-world example; a Kaggle competition from 2020.
翻译:通常对机器学习方法进行评价,并以其在公共储存库数据集上的性能进行比较,这样可以采用多种方法,通常有数千种,在相同的条件下和跨时间地进行评估。在某个问题上,排名最高的成绩被称为最先进的表现,被称作最先进的表现(SOTA),除其他外,用作公布新方法的参考点。使用排名最高的业绩作为SOTA的估计数是一种偏颇的估测器,产生过于乐观的结果。正在使用的机制是多重机制,在多重比较和多重测试中已经很好地研究过,但就作者所知,在讨论SOTA估计数时几乎没有这样做。乐观的状态估计被作为评估新方法的标准,而且很容易忽视结果非常差的方法。在本篇文章中,我们为多个分类者提供了一种概率分布,这样就可以进行已知的分析方法,并且可以提供更好的SOTA估计。我们通过一个模拟的例子展示了多重性的影响,但就作者所知,在讨论SOTA估计数时几乎看不到。我们如何用高的可靠性来分析。我们展示了在2020年的精确度上如何区分实际的可靠性。我们是如何分析的差别。</s>