Task-oriented dialogue systems (TDSs) are assessed mainly in an offline setting or through human evaluation. The evaluation is often limited to single-turn or very time-intensive. As an alternative, user simulators that mimic user behavior allow us to consider a broad set of user goals to generate human-like conversations for simulated evaluation. Employing existing user simulators to evaluate TDSs is challenging as user simulators are primarily designed to optimize dialogue policies for TDSs and have limited evaluation capabilities. Moreover, the evaluation of user simulators is an open challenge. In this work, we proposes a metaphorical user simulator for end-to-end TDS evaluation, where we define a simulator to be metaphorical if it simulates user's analogical thinking in interactions with systems. We also propose a tester-based evaluation framework to generate variants, i.e., dialogue systems with different capabilities. Our user simulator constructs a metaphorical user model that assists the simulator in reasoning by referring to prior knowledge when encountering new items. We estimate the quality of simulators by checking the simulated interactions between simulators and variants. Our experiments are conducted using three TDS datasets. The metaphorical user simulator demonstrates better consistency with manual evaluation than an agenda-based simulator and a seq2seq model on three datasets; our tester framework demonstrates efficiency, and better generalization and scalability because it can be adapted for dialogues in multiple domains and for multiple tasks, such as conversational recommendation and e-commerce dialogues.
翻译:以任务为导向的对话系统(TDS)主要在离线设置中或通过人文评价进行评估。评价通常限于单转或非常耗时。作为一种替代方法,模拟用户行为的用户模拟器可以让我们考虑一系列广泛的用户目标,为模拟评价生成人式对话。使用现有的用户模拟器来评价TDS具有挑战性,因为用户模拟器主要设计为TDS优化对话政策,评价能力有限。此外,用户模拟器的评估是一个公开的挑战。在这项工作中,我们为终端到终端的TDS评价建议一个隐喻用户模拟器。我们为终端到终端的TDS评价设计了一个隐喻式用户模拟器,作为模拟器,如果模拟用户模拟模拟模拟模拟人性对话,用于模拟模拟模拟人性对话,那么模拟人文的模拟器就具有挑战性。我们用户模型模拟了一种基于隐喻性的用户模拟器,在遇到新项目时可以参考先前的知识来协助模拟模拟用户模拟。我们用三个模拟的模拟性模拟器模拟了模拟程序,我们用3个模拟器模拟了模拟了模拟器模拟了我们的数据模拟和模拟程序的模拟程序。我们比模拟了3个模拟器的模拟程序的模拟了比模拟程序要更精确的模拟了我们的数据模拟程序,我们的数据模拟和模拟了比模拟程序要更精确的模拟了。