Task-oriented dialogue systems aim to fulfill user goals through natural language interactions. They are ideally evaluated with human users, which however is unattainable to do at every iteration of the development phase. Simulated users could be an alternative, however their development is nontrivial. Therefore, researchers resort to offline metrics on existing human-human corpora, which are more practical and easily reproducible. They are unfortunately limited in reflecting real performance of dialogue systems. BLEU for instance is poorly correlated with human judgment, and existing corpus-based metrics such as success rate overlook dialogue context mismatches. There is still a need for a reliable metric for task-oriented systems with good generalization and strong correlation with human judgements. In this paper, we propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus. Such an evaluator is typically called a critic and utilized for policy optimization. We go one step further and show that offline RL critics can be trained on a static corpus for any dialogue system as external evaluators, allowing dialogue performance comparisons across various types of systems. This approach has the benefit of being corpus- and model-independent, while attaining strong correlation with human judgements, which we confirm via an interactive user trial.
翻译:以任务为导向的对话系统旨在通过自然语言互动实现用户目标,最好与人类用户一起评价,但开发阶段的每一个周期都无法实现。模拟用户可能是另一种选择,尽管其发展是非边际的。因此,研究人员对现有的人类-人类团体采用离线衡量标准,这些衡量标准比较实用,容易复制,但不幸的是,这些衡量标准在反映对话系统的实际表现方面受到限制。例如,BLEU与人类判断和现有的基于实体的衡量标准,如成功率忽视对话环境的不匹配,不甚相干。仍然需要为面向任务、具有良好普遍性和与人类判断密切关联的系统制定可靠的衡量标准。在本文件中,我们提议利用离线强化学习,进行基于静态材料的对话评价。这种评价者通常称为批评者,用于政策优化。我们进一步进一步表明,离线的批评者可以在任何对话系统静态的基础上接受培训,作为外部评价者,允许对各种系统的对话业绩进行比较。这一方法的好处是,在通过动态和模型上进行对比,同时确认我们通过交互性判断,同时确认与用户的密切关联性判断。