Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based metrics and question answering (QA)-based metrics. However, differing experimental setups presented in recent work lead to contrasting conclusions as to which paradigm performs best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFactEval, that leads to a 15% average improvement over previous QA-based metrics on the SummaC factual consistency benchmark. Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance on this benchmark. Furthermore, we find that QA-based and entailment-based metrics offer complementary signals and combine the two into a single, learned metric for further performance boost. Through qualitative and quantitative analyses, we point to question generation and answerability classification as two critical components for future work in QA-based metrics.
翻译:实际一致性是实际环境中文本总结模型的基本质量。评估这一层面的现有工作可大致分为两条研究线,即基于要求的衡量尺度和基于问题回答(QA)的衡量尺度。然而,最近工作中提出的不同实验性设置导致对哪种模式效果最佳的对比性结论。在这项工作中,我们对要求和基于质量A的衡量尺度进行广泛比较,表明仔细选择基于质量A的衡量尺度的组成部分对业绩至关重要。在这些洞察的基础上,我们提出了优化的衡量尺度,我们称之为QAAAFactEval, 与以前基于质量A的SummaC事实一致性基准相比,平均改进了15%。我们的解决办法改进了基于要求的最佳衡量尺度,并实现了该基准的最新业绩。此外,我们发现基于质量A和基于要求的衡量尺度提供了补充信号,并将两种基于知识的衡量尺度合并成一个单一的、进一步绩效提升指标。通过定性和定量分析,我们指出生成问题和可回答性分类是未来工作的两个关键要素。