Neural abstractive summarization models are prone to generate summaries which are factually inconsistent with their source documents. Previous work has introduced the task of recognizing such factual inconsistency as a downstream application of natural language inference (NLI). However, state-of-the-art NLI models perform poorly in this context due to their inability to generalize to the target task. In this work, we show that NLI models can be effective for this task when the training data is augmented with high-quality task-oriented examples. We introduce Falsesum, a data generation pipeline leveraging a controllable text generation model to perturb human-annotated summaries, introducing varying types of factual inconsistencies. Unlike previously introduced document-level NLI datasets, our generated dataset contains examples that are diverse and inconsistent yet plausible. We show that models trained on a Falsesum-augmented NLI dataset improve the state-of-the-art performance across four benchmarks for detecting factual inconsistency in summarization. The code to obtain the dataset is available online at https://github.com/joshbambrick/Falsesum
翻译:先前的工作是承认这种事实上的不一致,将之作为自然语言推断(NLI)的下游应用。然而,在这方面,最先进的NLI模型由于无法概括目标任务,因此在这方面表现不佳。在这项工作中,我们表明,当培训数据增加高质量任务导向实例时,NLI模型能够有效完成这项任务。我们引入了“Falseum”,这是一个数据生成管道,利用可控文本生成模型来干扰人文注释摘要,引入了不同类型的事实不一致。与以前引入的文件级NLI数据集不同,我们生成的数据集包含了多样化和不一致但可信的实例。我们显示,经过培训的Falsumum-agment NLI数据集模型改进了四个基准的状态性能,以发现总和中的事实不一致性。获取数据集的代码可在https://github.com/joshbambrick/Falsesum网站上查阅。