使用大型语言模型进行无参考文本质量评估的探索：初步实证研究 (Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: A Preliminary Empirical Study)

Evaluating the quality of generated text is a challenging task in natural language processing. This difficulty arises from the inherent complexity and diversity of text. Recently, OpenAI's ChatGPT, a powerful large language model (LLM), has garnered significant attention due to its impressive performance in various tasks. Therefore, we present this report to investigate the effectiveness of LLMs, especially ChatGPT, and explore ways to optimize their use in assessing text quality. We compared three kinds of reference-free evaluation methods based on ChatGPT or similar LLMs. The experimental results prove that ChatGPT is capable to evaluate text quality effectively from various perspectives without reference and demonstrates superior performance than most existing automatic metrics. In particular, the Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches. However, directly comparing the quality of two texts using ChatGPT may lead to suboptimal results. We hope this report will provide valuable insights into selecting appropriate methods for evaluating text quality with LLMs such as ChatGPT.

翻译：在自然语言处理中，评估生成文本的质量是一项具有挑战性的任务。这种难度源于文本的内在复杂性和多样性。最近，由于其在各种任务中卓越的表现，OpenAI的ChatGPT（一种强大的大型语言模型）已经引起了广泛关注。因此，我们提出了这份报告来研究LLM的有效性，特别是ChatGPT，并探索优化它们在评估文本质量方面的使用方法。我们比较了基于ChatGPT或类似LLM的三种无参考评估方法。实验结果证明，ChatGPT能够有效评估文本质量，从各个角度进行评估而且没有参考标准，并且证明其性能优于大多数现有的自动指标。特别是，采用ChatGPT生成测量文本质量的数字分数的显式分数，是这三种松散的方法中最有效和可靠的方法。但是，直接比较ChatGPT评估的两个文本的质量可能会导致次优的结果。我们希望该报告能为选择适当的方法以LLM（如ChatGPT）评估文本质量提供有价值的见解。