The problems of unfaithful summaries have been widely discussed under the context of abstractive summarization. Though extractive summarization is less prone to the common unfaithfulness issues of abstractive summaries, does that mean extractive is equal to faithful? Turns out that the answer is no. In this work, we define a typology with five types of broad unfaithfulness problems (including and beyond not-entailment) that can appear in extractive summaries, including incorrect coreference, incomplete coreference, incorrect discourse, incomplete discourse, as well as other misleading information. We ask humans to label these problems out of 1500 English summaries produced by 15 diverse extractive systems. We find that 33% of the summaries have at least one of the five issues. To automatically detect these problems, we find that 5 existing faithfulness evaluation metrics for summarization have poor correlations with human judgment. To remedy this, we propose a new metric, ExtEval, that is designed for detecting unfaithful extractive summaries and is shown to have the best performance. We hope our work can increase the awareness of unfaithfulness problems in extractive summarization and help future work to evaluate and resolve these issues. Our data and code are publicly available at https://github.com/ZhangShiyue/extractive_is_not_faithful
翻译:在抽象总结的背景下,广泛讨论了不真实的摘要问题。尽管采掘总结不易处理抽象摘要中常见的不真实问题,但是否意味着采掘等于忠实?结果显示答案是否定的。在这项工作中,我们定义了一种类型,有五类广泛的不真实问题(包括和超出不完善的问题),这些问题可以在采掘摘要中出现,包括不正确的相互参照、不完全的相互参照、不正确的论述、不完全的谈话以及其他误导信息。我们要求人用15种不同采掘系统制作的1500份英文摘要来标出这些问题。我们发现,33%的摘要至少有五个问题中的一个。为了自动发现这些问题,我们发现现有的五种关于总结的忠实评价指标与人类判断不相符。为了纠正这一点,我们提出了一个新的标准,即ExtEval,旨在发现不真实的采掘摘要,并展示出最佳的性能。我们希望我们的工作能够提高对采掘总结中不真实的不真实问题的认识,有助于今后评估和解决这些问题。我们的数据和代码是公开的。http://lifrifrial_salshima/extz。