How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and propose changes to rectify this disconnect. First, we calculate the system score for an automatic metric using the full test set instead of the subset of summaries judged by humans, which is currently standard practice. We demonstrate how this small change leads to more precise estimates of system-level correlations. Second, we propose to calculate correlations only on pairs of systems that are separated by small differences in automatic scores which are commonly observed in practice. This allows us to demonstrate that our best estimate of the correlation of ROUGE to human judgments is near 0 in realistic scenarios. The results from the analyses point to the need to collect more high-quality human judgments and to improve automatic metrics when differences in system scores are small.
翻译:首先,我们用完整的测试组计算自动计量的系统评分,而不是目前标准做法的由人判断的摘要组分。我们展示了这一小变化如何导致更精确地估计系统级的关联。第二,我们提议只计算在实际中常见的自动得分中因小差异而分离的系统对两个系统的相关性。这使我们能够表明,在现实情景中,我们对于“ROUGE”与人类判断的关联性的最佳估计是接近0的。分析结果表明,需要收集更高质量的人类判断,并在系统分数差异小时改进自动计量。