Lack of factual correctness is an issue that still plagues state-of-the-art summarization systems despite their impressive progress on generating seemingly fluent summaries. In this paper, we show that factual inconsistency can be caused by irrelevant parts of the input text, which act as confounders. To that end, we leverage information-theoretic measures of causal effects to quantify the amount of confounding and precisely quantify how they affect the summarization performance. Based on insights derived from our theoretical results, we design a simple multi-task model to control such confounding by leveraging human-annotated relevant sentences when available. Crucially, we give a principled characterization of data distributions where such confounding can be large thereby necessitating the use of human annotated relevant sentences to generate factual summaries. Our approach improves faithfulness scores by 20\% over strong baselines on AnswerSumm \citep{fabbri2021answersumm}, a conversation summarization dataset where lack of faithfulness is a significant issue due to the subjective nature of the task. Our best method achieves the highest faithfulness score while also achieving state-of-the-art results on standard metrics like ROUGE and METEOR. We corroborate these improvements through human evaluation.
翻译:缺乏事实正确性是一个问题,尽管在制作看似流畅的摘要方面取得了令人印象深刻的进展,但却困扰着目前最先进的概括系统。在本文件中,我们表明,事实上的不一致可能是由于输入文本的不相关部分造成的,这些部分是混乱者。为此,我们利用因果效应的信息理论衡量方法,量化其影响总结性表现的程度,并准确地量化其影响总体性表现的程度。根据我们从理论结果中得出的见解,我们设计了一个简单的多任务模型,通过利用现有的附加说明的相关句子来控制这种混为一谈。关键是,我们给出了数据分布的原则性特征,因为这类混杂之处可能很大,因此需要使用人类附加说明的相关句子来产生事实性摘要。我们的方法提高了20个百分点的忠诚度,超过了答案Summ\citep{fabbri2021的强基线。根据我们从理论结果得出的一个谈话性总结数据集,由于任务的主观性质,缺乏忠诚是一个重要问题。我们的最佳方法达到了最高忠实度的分数,同时也证实了各项标准结果。