The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is not clear how precise these correlation estimates are, nor whether differences between two metrics' correlations reflects a true difference or if it is due to random chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
翻译:总结性评价指标的质量是通过在大量摘要中计算其分数和人文说明之间的关联度来量化的。目前,不清楚这些相关估计的准确度如何,或两个计量的相互关系之间的差别是否反映了真正的差异,或是否是由于随机偶然的缘故。在这项工作中,我们通过提出计算信任期的方法和采用两种重新采样方法,即靴子穿梭和变换,对相关性进行假设测试来解决这两个问题。在通过两个模拟实验对哪些拟议方法最适合进行总结之后,我们分析了将这些方法应用于三套人类说明的若干不同自动评价指标的结果。我们发现,信任期相当宽,表明在可靠的自动计量是否真正可靠方面存在着很大的不确定性。此外,尽管许多指标未能显示在ROUGE(两个最近的作品,QAEval和BERTScore)上统计上的改进,但在一些评估环境中,许多指标未能显示一些评估环境中的统计改进。