While neural language models can generate text with remarkable fluency and coherence, controlling for factual correctness in generation remains an open research question. This major discrepancy between the surface-level fluency and the content-level correctness of neural generation has motivated a new line of research that seeks automatic metrics for evaluating the factuality of machine text. In this paper, we introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. We propose five necessary and intuitive conditions to evaluate factuality metrics on diagnostic factuality data across three different summarization tasks. Our benchmark analysis on ten factuality metrics reveals that our meta-evaluation framework provides a robust and efficient evaluation that is extensible to multiple types of factual consistency and standard generation metrics, including QA metrics. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
翻译:虽然神经语言模型可以产生具有显著的流利性和一致性的文本,但在生成过程中控制事实正确性仍然是一个开放的研究问题。神经生成的表面流利性和内容正确性之间的这一重大差异促使人们进行新的研究,寻求自动衡量机器文本真实性的标准。在本文件中,我们引入了用于评价事实质量评价指标的元评价框架GO FIGURE。我们提出了五个必要和直观的条件,用以评价三种不同汇总任务中诊断性事实质量数据的事实质量指标。我们对十种事实质量指标的基准分析表明,我们的元评价框架提供了一种有力和有效的评价,可以适用于多种类型的事实一致性和标准生成指标,包括QA指标。它还表明,虽然质量评估指标普遍比衡量跨领域事实质量的标准指标有所改善,但业绩在很大程度上取决于产生问题的方式。