Modern summarization models generate highly fluent but often factually unreliable outputs. This motivated a surge of metrics attempting to measure the factuality of automatically generated summaries. Due to the lack of common benchmarks, these metrics cannot be compared. Moreover, all these methods treat factuality as a binary concept and fail to provide deeper insights into the kinds of inconsistencies made by different systems. To address these limitations, we devise a typology of factual errors and use it to collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN/DM and XSum datasets. Through these annotations, we identify the proportion of different categories of factual errors in various summarization models and benchmark factuality metrics, showing their correlation with human judgment as well as their specific strengths and weaknesses.
翻译:现代总和模型产生非常流利但往往在事实上不可靠的产出。这促使大量指标试图衡量自动生成摘要的真实性。由于缺乏共同基准,这些尺度无法比较。此外,所有这些方法将事实质量作为一个二进制概念处理,未能更深入地了解不同系统造成的各种不一致情况。为克服这些限制,我们设计了事实错误类型,并用它来收集CNN/DM和XSum数据集最先进的总和系统中产生的摘要的人类说明。我们通过这些说明,在各种总和模型和基准事实质量指标中确定不同类别事实错误的比例,显示它们与人类判断的相关性以及它们的具体长处和弱点。