Automatically evaluating the coherence of summaries is of great significance both to enable cost-efficient summarizer evaluation and as a tool for improving coherence by selecting high-scoring candidate summaries. While many different approaches have been suggested to model summary coherence, they are often evaluated using disparate datasets and metrics. This makes it difficult to understand their relative performance and identify ways forward towards better summary coherence modelling. In this work, we conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field. Additionally, we introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders. While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models fine-tuned on self-supervised tasks show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
翻译:自动评价摘要的一致性非常重要,既有利于进行具有成本效益的总结性评价,也是一种通过选择高分候选人摘要来改进一致性的工具。虽然提出了许多不同的方法来模拟摘要一致性,但往往使用不同的数据集和衡量标准来评价这些方法。这使得难以理解其相对性能,并找出改进摘要一致性模型的前进方向。在这项工作中,我们大规模调查了在公平竞争领域建立摘要一致性模型的各种方法。此外,我们引入了两种新颖的分析措施,即系统内部关联和偏差矩阵,以帮助查明一致性措施中的偏差,并对系统级的混杂者提供稳健性。虽然现有的自动一致性措施没有一个能够给所有评价指标的系统摘要分配可靠的一致性分数,但根据自我监督任务而调整的大型语文模型显示出有希望的结果,只要微调考虑到它们需要将不同摘要的长度加以概括。