Task-Oriented Dialogue (TOD) systems are drawing more and more attention in recent studies. Current methods focus on constructing pre-trained models or fine-tuning strategies while the evaluation of TOD is limited by a policy mismatch problem. That is, during evaluation, the user utterances are from the annotated dataset while these utterances should interact with previous responses which can have many alternatives besides annotated texts. Therefore, in this work, we propose an interactive evaluation framework for TOD. We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues. Besides, we introduce a sentence-level and a session-level score to measure the sentence fluency and session coherence in the interactive evaluation. Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates in the interactive evaluation of MultiWOZ dataset and the proposed scores measure the response quality besides the inform and success rates. We are hoping that our work will encourage simulator-based interactive evaluations in the TOD task.
翻译:以任务为主的对话(TOD)系统正在在最近的研究中引起越来越多的关注。目前的方法侧重于建立预先培训的模式或微调战略,而对于TOD的评价则受到政策不匹配问题的限制。也就是说,在评价期间,用户的发声来自附加说明的数据集,而这些发声应当与先前的答复相互作用,这些答复除了附加说明的文本之外,还可以有许多其他选择。因此,在这项工作中,我们建议为TOD建立一个互动评价框架。我们首先根据预先培训的模式建立一个面向目标的用户模拟器,然后利用用户模拟器与对话系统互动,以产生对话。此外,我们引入一个判刑等级和届会等级的评分,以衡量互动式评价中的判刑流畅度和会议的一致性。实验结果表明,由我们拟议用户模拟器培训的基于RL的TOD系统在互动式评价多 WOZ数据集方面可以达到近98%的知情率和成功率,而拟议的分数衡量除了知情率和成功率之外还衡量反应质量。我们希望我们的工作将鼓励模拟以模拟为基础的互动式评价任务中的交互式评价。