The performance of abstractive text summarization has been greatly boosted by pre-trained language models recently. The main concern of existing abstractive summarization methods is the factual inconsistency problem of their generated summary. To alleviate the problem, many efforts have focused on developing effective factuality evaluation metrics based on natural language inference and question answering et al. However, they have limitations of high computational complexity and relying on annotated data. Most recently, large language models such as ChatGPT have shown strong ability in not only natural language understanding but also natural language inference. In this paper, we study the factual inconsistency evaluation ability of ChatGPT under the zero-shot setting by evaluating it on the coarse-grained and fine-grained factuality evaluation tasks including binary natural language inference (NLI), summary ranking, and consistency rating. Experimental results show that ChatGPT outperforms previous SOTA evaluation metrics on 6/9 datasets across three tasks, demonstrating its great potential for assessing factual inconsistency in the zero-shot setting. The results also highlight the importance of prompt design and the need for future efforts to address ChatGPT's limitations on evaluation bias, wrong reasoning, and hallucination.
翻译:近年来,预训练语言模型极大地提高了抽象文本摘要的性能。现有抽象摘要方法的主要问题是所生成的摘要存在事实不一致性。为了缓解这个问题,许多研究集中于开发有效的基于自然语言推理和问题回答等验证度量。然而,这些验证度量存在高计算复杂性和依赖注释数据的限制。最近,例如ChatGPT之类的大型语言模型不仅在自然语言理解方面表现出强大的能力,而且在自然语言推理方面也表现出优异性能。本文研究了ChatGPT在零样本场景下事实不一致性评估能力,包括二值自然语言推理(NLI)、摘要排序、一致性评分等粗粒度和细粒度事实度量任务的评估。实验结果表明,在三个任务的6/9数据集上,ChatGPT优于现有SOTA评估度量,展示了其在零样本情况下评估事实不一致性的潜力。结果还突显了提示设计的重要性以及未来需要解决ChatGPT在评估偏见、错误推理和产生幻觉方面的限制。