Despite recent progress in abstractive summarization, models often generate summaries with factual errors. Numerous approaches to detect these errors have been proposed, the most popular of which are question answering (QA)-based factuality metrics. These have been shown to work well at predicting summary-level factuality and have potential to localize errors within summaries, but this latter capability has not been systematically evaluated in past research. In this paper, we conduct the first such analysis and find that, contrary to our expectations, QA-based frameworks fail to correctly identify error spans in generated summaries and are outperformed by trivial exact match baselines. Our analysis reveals a major reason for such poor localization: questions generated by the QG module often inherit errors from non-factual summaries which are then propagated further into downstream modules. Moreover, even human-in-the-loop question generation cannot easily offset these problems. Our experiments conclusively show that there exist fundamental issues with localization using the QA framework which cannot be fixed solely by stronger QA and QG models.
翻译:尽管在抽象总结方面最近取得了进展,但模型往往产生带有事实错误的摘要。提出了许多发现这些错误的方法,其中最受欢迎的是基于问题回答(QA)的事实质量度量。这些方法已证明在预测简要事实质量方面效果良好,有可能在摘要中将错误本地化,但后一种能力在以往的研究中没有得到系统的评估。在本文件中,我们进行了第一次这样的分析,发现与我们的期望相反,基于质量A的框架未能正确查明生成摘要中的错误,而且其完成率比微小的精确匹配基线要高。我们的分析揭示了这种地方化不足的主要原因:QG模块产生的问题往往继承了非事实摘要的错误,这些错误随后被进一步传播到下游模块中。此外,即使是人与人之间的问题生成也无法轻易地抵消这些问题。我们的实验结论性地表明,在使用质量A框架的地方化方面存在根本问题,而这些根本问题不能仅仅通过更强大的质量A和QG模型来固定。