什么是最佳系统?关于国家劳工规划基准确定的新观点 (What are the best systems? New perspectives on NLP Benchmarking)

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.

翻译：在机器学习中,一个基准指一组与一种或多种指标相关的数据集,以及综合不同系统性能的方法,这些数据集有助于(一) 评估不同轴线上新方法的进展,和(二) 选择最佳实际使用系统。在机器学习中,一个基准指一组与一个或多个指标相关的数据集,以及综合不同系统性能的方法。这些基准有助于(一) 评估不同轴线上新方法的进展,和(二) 选择最佳实际使用系统。对于国家实验室来说尤其如此,因为开发了大型预先培训的模型(如GPT、BERT),预期这些模型将广泛概括各种任务。虽然社区主要侧重于开发新的数据集和指标,但对汇总程序兴趣不大,但对于总程序往往降为简单的平均数,但对于总程序往往会变得不那么简单;然而,当衡量尺度处于不同规模时,这一程序可能会有问题。这份文件提出了一种根据不同任务性能对系统进行评级的新程序。受社会选择理论的驱动,最后的系统排序是通过将每项任务引发的等级和理论上的。我们进行了广泛的数字实验(超过270k分),以评估我们的方法在合成和真实得分数上是否正确,SEIR-R-SEA-SAL-del-C-C-C-SAL-SU-SU-S-SU-SU-ex-ex-ex-ex-ex-ex-S-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-SU-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-SAL-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-I-I-ex-ex-ex-ex-in-ex-I-I-I-in-I-I-in-ex-ex-ex-ex-ex-ex-ex-in-in-in-in-ex-I-I-in-I-I-in-in-I-in-in-in