Text summarization has a wide range of applications in many scenarios. The evaluation of the quality of the generated text is a complex problem. A big challenge to language evaluation is that there is a clear divergence between existing metrics and human evaluation. For example, the quality of a document summary can be measured by human annotators from both objective aspects, such as grammatical and semantic correctness, as well as subjective dimensions, such as comprehensiveness, succinctness, and interestingness. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to capture the above dimensions well. In this paper, we propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects. First, we propose to model objective and subjective dimensions of generated text based on roleplayers prompting mechanism. Furthermore, we introduce a context-based prompting mechanism that is able to generate dynamic roleplayer profiles based on input context. Finally, we design a multi-roleplayer prompting technology based on batch prompting to integrate multiple evaluation results into evaluation results. Experimental results on two real datasets for summarization show that our model is highly competitive and has a very high consistency with human annotators.
翻译:摘要:文本摘要在许多场景中具有广泛的应用。评估生成文本的质量是一个复杂的问题。语言评估面临的一个大挑战是现有度量和人类评价之间存在明显的分歧。例如,人类评价者可以根据客观方面(如语法和语义正确性)和主观维度(如全面性、简洁性和趣味性)来衡量文档摘要的质量,但大多数自动评估方法(如BLUE/ROUGE)可能无法很好地捕捉上述维度。在本文中,我们提出了一种基于LLMs的新的评估框架,通过从客观和主观方面比较生成文本和参考文本,提供了全面的评估框架。首先,我们提出了基于角色扮演者提示机制来建模生成文本的客观和主观维度。此外,我们引入了一种基于上下文的提示机制,该机制能够基于输入上下文生成动态的角色扮演者配置文件。最后,我们设计了一种基于批处理提示的多角色扮演者提示技术,将多个评估结果集成到评估结果中。在两个真实数据集的摘要实验结果表明,我们的模型具有很高的竞争力,并且与人类评估者有很高的一致性。