Evaluating the quality of generated text is a challenging task in natural language processing. This difficulty arises from the inherent complexity and diversity of text. Recently, OpenAI's ChatGPT, a powerful large language model (LLM), has garnered significant attention due to its impressive performance in various tasks. Therefore, we present this report to investigate the effectiveness of LLMs, especially ChatGPT, and explore ways to optimize their use in assessing text quality. We compared three kinds of reference-free evaluation methods based on ChatGPT or similar LLMs. The experimental results prove that ChatGPT is capable to evaluate text quality effectively from various perspectives without reference and demonstrates superior performance than most existing automatic metrics. In particular, the Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches. However, directly comparing the quality of two texts using ChatGPT may lead to suboptimal results. We hope this report will provide valuable insights into selecting appropriate methods for evaluating text quality with LLMs such as ChatGPT.
翻译:在自然语言处理中,评估生成文本的质量是一项具有挑战性的任务。这种困难源于文本的本身复杂性和多样性。近来,OpenAI的ChatGPT,一种强大的大型语言模型(LLM),由于在各种任务中表现出色而引起了广泛的关注。因此,我们提出这份报告,旨在探讨LLMs,特别是ChatGPT的有效性,并探索优化它们在评估文本质量方面的使用方法。我们比较了三种基于ChatGPT或类似LLMs的无参考评估方法。实验结果证明,ChatGPT能够有效地从各种角度评估文本质量,而不需要与参考相比,并且表现出优于大多数现有自动度量的性能。特别是,Explicit Score利用ChatGPT生成一个数字分数来衡量文本质量,是三种方法中最有效和可靠的方法。然而,直接使用ChatGPT比较两个文本的质量可能会导致次优结果。我们希望这份报告能够为选择适当的LLMs方法评估文本质量提供有价值的见解。