Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show a very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
翻译:早就应该对互动环境中的对话系统进行可靠的自动评价了。评价对话系统的理想环境,又称图灵测试,需要涉及人类互动,而这种互动通常无法为大规模试验所负担得起。虽然研究人员试图在语言生成任务或自动评价的某种基于模型的强化学习方法(如自玩评价)中使用衡量标准(如:不易懂、不易懂),但这些方法只显示与实际实际的人类评价关系极弱。为了弥补这种差距,我们提议了一个称为ENIGMA的新框架,用于根据加强学习方面的非政策评价的最新进展来估计人类评价得分。ENIGMA只需要少量预先收集的经验数据,因此在评价期间并不涉及与目标政策的人际互动,因此使自动评价成为可行。更重要的是,ENIGMA没有模型,也没有对收集经验数据的行为政策作出指导(见第2节的细节),这大大减轻了模拟复杂对话环境和人类行为方面的技术困难。我们的实验表明,ENIGMA在与人类评价的相关性方面明显超出现有方法。