Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of averages for aggregating evaluation scores into a final number used to decide which system is best, since the average, as well as alternatives such as the median, ignores the pairing arising from the fact that systems are evaluated on the same test instances. We illustrate the importance of taking the instance-level pairing of evaluation scores into account and demonstrate, both theoretically and empirically, the advantages of aggregation methods based on pairwise comparisons, such as the Bradley-Terry (BT) model, a mechanism based on the estimated probability that a given system scores better than another on the test set. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics, we show that the choice of aggregation mechanism matters and yields different conclusions as to which systems are state of the art in about 30% of the setups. To facilitate the adoption of pairwise evaluation, we release a practical tool for performing the full analysis of evaluation scores with the mean, median, BT, and two variants of BT (Elo and TrueSkill), alongside functionality for appropriate statistical testing.
翻译:在这项工作中,我们质疑使用平均数将评价分数汇总成最后数字以决定哪个系统最佳的估计概率,因为平均数以及中位数等替代办法忽略了因系统在同一测试实例中被评价而出现的配对。我们说明了将对等评价分数的试级配对考虑在内的重要性,并且从理论上和经验上展示了基于对称比较的综合方法的优势,例如布拉德利-泰瑞(BT)模型,这是一种基于一个特定系统在测试集上比另一个系统得分好的估计概率的机制。我们通过重新评价296个真正的NLP评价组合,跨越了四个任务和18个评价指标,我们表明对合并机制的选择很重要,并得出了不同结论,说明哪些系统在大约30%的设置中处于最新状态。为了便于采用对称评价,我们发布了一个实用工具,用一个中位、BT和两个适当的统计变量对评价分数进行全面分析,同时进行BT的中位、中位、BT和SillS(与DRAS的正常和适当功能)的统计功能测试。