In this paper, we propose FFCI, a framework for automatic summarization evaluation that comprises four elements: Faithfulness, Focus, Coverage, and Inter-sentential coherence. We design FFCI by comprehensively studying traditional evaluation metrics and model-based evaluations, including question answering (QA) approaches, STS, next-sentence prediction (NSP), and scores from 19 pre-trained language models. Our study reveals three key findings: (1) calculating BertSCORE between the summary and article sentences yields a higher correlation score than recently-proposed QA-based evaluation methods for faithfulness evaluation; (2) GPT2Score has the best Pearson's correlation for focus and coverage; and (3) a simple NSP model is effective at evaluating inter-sentential coherence.
翻译:在本文中,我们提议FFCI,这是一个自动总结评价框架,由四个要素组成:信仰、焦点、覆盖面和理论间的一致性。我们通过全面研究传统的评价指标和基于模型的评价来设计FFCI,包括问答方法、STS、下一句判决预测和19个预先培训的语言模型的分数。我们的研究揭示了三个主要结论:(1)计算摘要与文章句子之间的BertSCORE得出了比最近提出的基于QA的忠诚评价方法更高的相关得分;(2)GPT2Score在重点和覆盖面方面具有最好的皮尔逊人相关性;(3)一个简单的NSP模式能够有效地评价当前的一致性。