Once powerful conversational models have become available for a wide audience, users started actively engaging in social interactions with this technology. Such unprecedented interaction experiences may pose considerable social and psychological risks to the users unless the technology is properly controlled. This creates an urgent need for scalable and robust evaluation metrics for conversational chatbots. Existing automatic evaluation metrics usually focus on objective quality measures and disregard subjective perceptions of social dimensions. Moreover, most of these approaches operate on pre-produced dialogs from available benchmark corpora, which implies human involvement for preparing the material for evaluation and, thus, impeded scalability of the metrics. To address this limitation, we propose to make use of the emerging large language models (LLMs) from the GPT-family and describe a new framework allowing to conduct dialog system evaluation with prompting. With this framework, we are able to achieve full automation of the evaluation pipeline and reach impressive correlation with the human judgement (up to Pearson r=0.95 on system level). The underlying concept is to collect synthetic chat logs of evaluated bots with a LLM in the other-play setting, where LLM is carefully conditioned to follow a specific scenario. We further explore different prompting approaches to produce evaluation scores with the same LLM. The best-performing prompts, containing few-show demonstrations and instructions, show outstanding performance on the tested dataset and demonstrate the ability to generalize to other dialog corpora.
翻译:摘要:随着强大的对话模型适用于广大用户群体,用户开始积极地与这项技术进行社交互动。除非技术得到妥善控制,否则这种前所未有的互动体验可能对用户造成相当大的社会和心理风险。这需要可伸缩和健壮的对话聊天机器人评估指标。现有的自动评估指标通常集中在客观质量指标上,忽略了社会维度的主观感知。此外,大多数这些方法都在可用基准语料库中运行预先生产的对话,这意味着需要人类参与评估材料的准备,从而限制了指标的可伸缩性。为了解决这个限制,我们提出利用GPT系列的新兴大语言模型(LLMs)并描述一个新的框架来允许使用提示进行对话系统评估。通过这个框架,我们能够实现完全自动化的评估流程,并与人类判断达到了惊人的相关性(在系统水平上高达Pearson r=0.95)。底层概念是使用LLM在其他播放环境下收集被评估机器人的合成聊天记录,其中LLM被仔细调节以遵循特定的场景。我们进一步探讨了不同的提示方法,以使用相同的LLM产生评估得分。性能最好的提示,包含少量演示和说明,显示出该测试数据集上的卓越性能,并展示了广泛适用于其他对话语料库的能力。