Evaluation is crucial in the development process of task-oriented dialogue systems. As an evaluation method, user simulation allows us to tackle issues such as scalability and cost-efficiency, making it a viable choice for large-scale automatic evaluation. To help build a human-like user simulator that can measure the quality of a dialogue, we propose the following task: simulating user satisfaction for the evaluation of task-oriented dialogue systems. The purpose of the task is to increase the evaluation power of user simulations and to make the simulation more human-like. To overcome a lack of annotated data, we propose a user satisfaction annotation dataset, USS, that includes 6,800 dialogues sampled from multiple domains, spanning real-world e-commerce dialogues, task-oriented dialogues constructed through Wizard-of-Oz experiments, and movie recommendation dialogues. All user utterances in those dialogues, as well as the dialogues themselves, have been labeled based on a 5-level satisfaction scale. We also share three baseline methods for user satisfaction prediction and action prediction tasks. Experiments conducted on the USS dataset suggest that distributed representations outperform feature-based methods. A model based on hierarchical GRUs achieves the best performance in in-domain user satisfaction prediction, while a BERT-based model has better cross-domain generalization ability.
翻译:作为一种评价方法,用户模拟使我们能够处理可缩放性和成本效益等问题,使之成为大规模自动评价的可行选择。为了帮助建立一个能衡量对话质量的像人一样的用户模拟器,我们提出以下任务:使用户对面向任务的对话系统的评价感到满意;任务的目的是提高用户模拟的评价能力,使模拟更符合人性。为了克服缺乏附加说明的数据,我们提议了一个用户满意度说明数据集(USS),其中包括从多个领域抽样的6 800个对话,横跨真实世界电子商务对话,通过奥兹诺实验建立的任务导向对话,以及电影建议对话。这些对话中的所有用户的言论,以及对话本身,都是以5级满意度尺度为标签的。我们还分享了三种用户满意度预测和行动预测的基线方法。对USS数据集进行的实验表明,分布式显示超越基于功能的跨域、跨真实世界电子商务对话、通过奥兹图实验建立的任务导向性对话,以及电影建议对话。这些对话中的所有用户的言论,以及对话本身,都是以5级满意度尺度为基础的用户满意度预测和行动预测。根据USS数据集进行的实验表明,分布式显示分布式的特征比基于地位模型,而用户的预测具有更好的总体满意度。