Reinforcement learning has been widely adopted to model dialogue managers in task-oriented dialogues. However, the user simulator provided by state-of-the-art dialogue frameworks are only rough approximations of human behaviour. The ability to learn from a small number of human interactions is hence crucial, especially on multi-domain and multi-task environments where the action space is large. We therefore propose to use structured policies to improve sample efficiency when learning on these kinds of environments. We also evaluate the impact of learning from human vs simulated experts. Among the different levels of structure that we tested, the graph neural networks (GNNs) show a remarkable superiority by reaching a success rate above 80% with only 50 dialogues, when learning from simulated experts. They also show superiority when learning from human experts, although a performance drop was observed, indicating a possible difficulty in capturing the variability of human strategies. We therefore suggest to concentrate future research efforts on bridging the gap between human data, simulators and automatic evaluators in dialogue frameworks.
翻译:在以任务为导向的对话中,强化学习被广泛用于示范对话管理者;然而,最先进的对话框架提供的用户模拟器只是人类行为的粗略近似值;因此,从少量人类互动中学习的能力至关重要,特别是在行动空间巨大的多领域和多任务环境中;因此,我们提议在学习这类环境时,采用结构化政策提高抽样效率;我们还评估人类与模拟专家的学习影响;在我们测试的不同层次的结构中,图形神经网络(GNN)显示显著优势,在向模拟专家学习时,其成功率超过80%,只有50次对话;在向人类专家学习时,这些能力也显示出优势,尽管观察到业绩下降,表明在捕捉人类战略的变异性方面可能存在困难;因此,我们建议今后的研究工作集中于缩小对话框架中的人类数据、模拟器和自动评价器之间的差距。