The performance of text summarization has been greatly boosted by pre-trained language models. A main concern of existing methods is that most generated summaries are not factually inconsistent with their source documents. To alleviate the problem, many efforts have focused on developing effective factuality evaluation metrics based on natural language inference, question answering, and syntactic dependency et al. However, these approaches are limited by either their high computational complexity or the uncertainty introduced by multi-component pipelines, resulting in only partial agreement with human judgement. Most recently, large language models(LLMs) have shown excellent performance in not only text generation but also language comprehension. In this paper, we particularly explore ChatGPT's ability to evaluate factual inconsistency under a zero-shot setting by examining it on both coarse-grained and fine-grained evaluation tasks including binary entailment inference, summary ranking, and consistency rating. Experimental results indicate that ChatGPT generally outperforms previous evaluation metrics across the three tasks, indicating its great potential for factual inconsistency evaluation. However, a closer inspection of ChatGPT's output reveals certain limitations including its preference for more lexically similar candidates, false reasoning, and inadequate understanding of instructions.
翻译:摘要:预训练语言模型使文本摘要的性能有了很大的提升。现有方法的一个主要问题是,大多数生成的摘要与其源文档存在事实上的不一致性。为了缓解这个问题,许多方法致力于基于自然语言推理,问题回答和句法依存等开发有效的事实性评估指标。然而,这些方法要么计算复杂度高,要么由多个组件的管道引入了不确定性,结果只与人类判断部分一致。最近,大型语言模型(LLM)在文本生成和语言理解方面的表现都非常出色。在本文中,我们特别探索 ChatGPT 在零样本情况下评估事实不一致性的能力,通过对二元蕴含推理、摘要排名和一致性评分等粗粒度和细粒度评估任务进行检查。实验结果表明,在三项任务中,ChatGPT通常优于以前的评估指标,表明它在事实不一致性评估方面具有巨大的潜力。然而,对ChatGPT的输出进行更详细的检查发现,它存在一定的局限性,包括更多相关候选者的偏好、虚假推断和指令理解不足等问题。