Current pre-trained models applied to summarization are prone to factual inconsistencies which either misrepresent the source text or introduce extraneous information. Thus, comparing the factual consistency of summaries is necessary as we develop improved models. However, the optimal human evaluation setup for factual consistency has not been standardized. To address this issue, we crowdsourced evaluations for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols, on 100 articles from each of the CNN-Daily Mail and XSum datasets over four state-of-the-art models, to determine the most reliable evaluation framework. We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design. Our crowdsourcing templates and summary evaluations will be publicly available to facilitate future research on factual consistency in summarization.
翻译:目前用于总结的经过事先培训的模型容易出现事实上的不一致,要么歪曲源文本,要么引入不相干的信息。因此,在我们开发改进的模型时,比较摘要的实际一致性是必要的。然而,为求事实一致性而建立的最佳人力评价架构尚未标准化。为解决这一问题,我们利用基于评级的 " 爱丽特 " 比例表和基于排名的最佳沃斯特缩放协议,对CNN-Daily Mail和XSum四个最先进的模型的100篇文章进行事实性评价,以确定最可靠的评价框架。我们发现,基于排名的规程为各数据集的汇总质量提供了更可靠的衡量尺度,而利特评级的可靠性取决于目标数据集和评价设计。我们的众包模板和摘要评价将公开提供,以便利今后对总称方面的实际一致性进行研究。