Text summarization has a wide range of applications in many scenarios. The evaluation of the quality of the generated text is a complex problem. A big challenge to language evaluation is that there is a clear divergence between existing metrics and human evaluation. For example, the quality of a document summary can be measured by human annotators from both objective aspects, such as grammatical and semantic correctness, as well as subjective dimensions, such as comprehensiveness, succinctness, and interestingness. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to capture the above dimensions well. In this paper, we propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects. First, we propose to model objective and subjective dimensions of generated text based on roleplayers prompting mechanism. Furthermore, we introduce a context-based prompting mechanism that is able to generate dynamic roleplayer profiles based on input context. Finally, we design a multi-roleplayer prompting technology based on batch prompting to integrate multiple evaluation results into evaluation results. Experimental results on two real datasets for summarization show that our model is highly competitive and has a very high consistency with human annotators.
翻译:----
文本摘要在许多场景中具有广泛的应用。生成文本质量的评估是一个复杂的问题。语言评估面临的一个巨大挑战是现有指标和人类评估之间存在明显分歧。例如,文档摘要的质量可以从客观方面(如语法和语义的正确性)和主观维度(如全面性、简洁性和趣味性)由人工标注者评估。大多数自动评估方法(如BLUE/ROUGE)可能无法很好地捕捉上述维度。在本文中,我们提出了一个基于LLM的新的评估框架,通过比较生成的文本和参考文本从客观和主观两个方面提供全面的评估框架。首先,我们提出了基于角色扮演提示机制的生成文本的客观和主观维度建模。此外,我们引入了一种基于输入上下文生成动态角色扮演者配置文件的上下文提示技术。最后,我们基于批量提示设计了多角色扮演提示技术,将多个评估结果集成到评估结果中。在两个真实数据集的摘要实验中,我们的模型具有极高的竞争力,并且与人类标注者具有非常高的一致性。