State-of-the-art summarization systems can generate highly fluent summaries. These summaries, however, may contain factual inconsistencies and/or information not present in the source. Hence, an important component of assessing the quality of summaries is to determine whether there is information consistency between the source and the summary. Existing approaches are typically based on lexical matching or representation-based methods. In this work, we introduce an alternative scheme based on standard information-theoretic measures in which the information present in the source and summary is directly compared. We propose a Multiple-choice Question Answering and Generation framework, MQAG, which approximates the information consistency by computing the expected KL-divergence between summary and source answer distributions over automatically generated multiple-choice questions. This approach exploits multiple-choice answer probabilities, as predicted answer distributions can be easily compared. We conduct experiments on four summary evaluation datasets: QAG-CNNDM/XSum, XSum-Faithfulness, Podcast Assessment, and SummEval. Experiments show that MQAG (using models trained on RACE) outperforms existing evaluation methods on the majority of tasks.
翻译:然而,这些摘要可能包含事实上的不一致和/或来源中不存在的信息。因此,评估摘要质量的一个重要内容是确定源与摘要之间是否存在信息的一致性。现有办法通常以词汇匹配或基于陈述的方法为基础。在这项工作中,我们采用基于标准信息理论的替代办法,直接比较来源和摘要中所含的信息,我们提议了一个多选择问题回答和生成框架,即MAQAG,它通过计算自动生成的多选择问题的概要和源答案分布之间预期的 KL 差异性来比较信息的一致性。这种方法利用了多种选择答案的概率,因为预测的答复分布很容易比较。我们试验了四个简要评价数据集:QAG-CNNDM/XSum、XSum-Faithimity、Podcast Aview和SumMEval。实验显示,MQAG(使用经过培训的关于RACE现有评价方法的多数模型)超越了现有评价方法。