Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric, especially question generation and answerability classification, is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFactEval, that leads to a 14% average improvement over previous QA-based metrics on the SummaC factual consistency benchmark, and also outperforms the best-performing entailment-based metric. Moreover, we find that QA-based and entailment-based metrics can offer complementary signals and be combined into a single metric for a further performance boost.
翻译:实际一致性是实际环境中文本总结模型的基本质量。评估这一层面的现有工作可以大致分为两行研究,即基于要求的和基于问答的衡量标准,而不同的实验设置往往导致对哪种模式最能发挥作用的对比结论。在这项工作中,我们对基于要求的和基于质量评估的衡量标准进行广泛的比较,表明仔细选择基于质量评估的衡量标准的组成部分,特别是问题生成和可回答性分类,对于业绩至关重要。基于这些认识,我们建议采用优化衡量标准,我们称之为QAFactEval, 与以前基于质量评估的SummaC事实一致性基准相比,平均改进14%,也超过了基于最佳表现的衡量标准。此外,我们发现基于质量评估的和基于要求的衡量标准可以提供补充信号,并被合并为一种单一的衡量标准,以进一步推进绩效。