The propensity of abstractive summarization systems to make factual errors has been the subject of significant study, including work on models to detect factual errors and annotation of errors in current systems' outputs. However, the ever-evolving nature of summarization systems, error detectors, and annotated benchmarks make factuality evaluation a moving target; it is hard to get a clear picture of how techniques compare. In this work, we collect labeled factuality errors from across nine datasets of annotated summary outputs and stratify them in a new way, focusing on what kind of base summarization model was used. To support finer-grained analysis, we unify the labeled error types into a single taxonomy and project each of the datasets' errors into this shared labeled space. We then contrast five state-of-the-art error detection methods on this benchmark. Our findings show that benchmarks built on modern summary outputs (those from pre-trained models) show significantly different results than benchmarks using pre-Transformer models. Furthermore, no one factuality technique is superior in all settings or for all error types, suggesting that system developers should take care to choose the right system for their task at hand.
翻译:抽象总和系统的倾向性是造成事实错误的抽象总和系统的倾向,这是一项重要研究的主题,包括研究用于探测事实错误的模式和说明当前系统产出错误的模型。然而,总和系统、错误探测器和附加说明的基准不断演化的性质使得事实质量评价成为一个移动的目标;很难清楚地了解技术如何比较。在这项工作中,我们从附加说明的概要产出的九个数据集中收集标记的事实质量错误,并以新的方式加以分解,重点是使用何种基本总和模型。为了支持精细分析,我们将标记的错误类型合并成一个单一的分类系统,并将每个数据集的错误投射到这个共同的标签空间。我们随后比较了这个基准上的五个最先进的错误探测方法。我们的调查结果显示,基于现代总产出的基准(从预先培训的模型中得出的)显示的结果与使用前转换模型的基准有很大不同。此外,在所有环境中或所有错误类型中,没有一个事实质量技术优于所有设置或所有错误类型,表明系统开发者应该选择正确的系统。