The topic of summarization evaluation has recently attracted a surge of attention due to the rapid development of abstractive summarization systems. However, the formulation of the task is rather ambiguous, neither the linguistic nor the natural language processing community has succeeded in giving a mutually agreed-upon definition. Due to this lack of well-defined formulation, a large number of popular abstractive summarization datasets are constructed in a manner that neither guarantees validity nor meets one of the most essential criteria of summarization: factual consistency. In this paper, we address this issue by combining state-of-the-art factual consistency models to identify the problematic instances present in popular summarization datasets. We release SummFC, a filtered summarization dataset with improved factual consistency, and demonstrate that models trained on this dataset achieve improved performance in nearly all quality aspects. We argue that our dataset should become a valid benchmark for developing and evaluating summarization systems.
翻译:总结评价专题最近由于抽象总结系统的迅速发展而引起人们的高度关注,然而,任务的拟订相当模糊,无论是语言界还是自然语言处理界都没有成功地作出相互同意的定义。由于缺乏明确界定的表述,大量流行的抽象总结数据集的构建方式既不能保证有效性,也不符合总结的最基本标准之一:事实一致性。在本文件中,我们通过将最新的事实一致性模型结合起来来解决这一问题,以查明大众总结数据集中存在的问题实例。我们发行了SummFC,这是一套经过过滤的总结数据集,其实际一致性得到了提高,并表明在这种数据集上培训的模型几乎在所有质量方面都取得了更好的绩效。我们主张,我们的数据集应当成为发展和评估汇总系统的有效基准。