Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of NLG models is an arduous task and previous statistical metrics notoriously show their poor correlation with human judgments, we wonder whether ChatGPT is a good NLG evaluation metric. In this report, we provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to score the generation of NLG models. We conduct experiments on three widely-used NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks). Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with golden human judgments. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
翻译:最近,恰特格普特的出现吸引了计算语言界的广泛关注,许多先前的研究显示,查特格普特在自动评价指标方面在各种国家劳工政策任务中取得了显著成绩,然而,查特格特特特作为评价指标的能力仍未得到充分探讨。考虑到评估国家劳工政策模型的质量是一项艰巨的任务,而以往的统计指标臭名昭著地表明,它与人类判断之间的关系不佳,我们想知道查特特特特是否是国家劳工政策评价的好指标。我们在本报告中提供了关于查特普特的初步元评价,以显示其作为国家劳工政策指标的可靠性。我们详细地将查特特特特特特作为人类评价员,并给予具体任务(例如,总结)和具体方面(例如,相关性)指导,以促使查特特特特特特特制作国家劳工政策模型。我们试验了三个广泛使用的国家劳工政策委员会元评价数据集(包括总结、故事生成和数据对文本的任务),我们实验的结果显示,与以前的自动计量标准相比,查特格格特特特特特特特特特特特特特特特特(ChattG)与我们的国家初步的、创新和Gregreal-Gregnial-G/Gslent/Gdgregent/Gdgentalgent/WWealstgentgentgregalstggggs可以取得初步的、Birst/Birst/Gsg)之间的初步研究。</s>