Dialogue policy optimisation via reinforcement learning requires a large number of training interactions, which makes learning with real users time consuming and expensive. Many set-ups therefore rely on a user simulator instead of humans. These user simulators have their own problems. While hand-coded, rule-based user simulators have been shown to be sufficient in small, simple domains, for complex domains the number of rules quickly becomes intractable. State-of-the-art data-driven user simulators, on the other hand, are still domain-dependent. This means that adaptation to each new domain requires redesigning and retraining. In this work, we propose a domain-independent transformer-based user simulator (TUS). The structure of our TUS is not tied to a specific domain, enabling domain generalisation and learning of cross-domain user behaviour from data. We compare TUS with the state of the art using automatic as well as human evaluations. TUS can compete with rule-based user simulators on pre-defined domains and is able to generalise to unseen domains in a zero-shot fashion.
翻译:通过强化学习实现对话政策优化需要大量的培训互动,这使得与实际用户的学习耗时费钱。因此,许多设置都依赖于用户模拟器而不是人类。这些用户模拟器有自己的问题。虽然手工编码的、基于规则的用户模拟器在小的简单领域已经证明足够,对于复杂的领域,规则的数量很快变得难以解决。另一方面,由数据驱动的状态用户模拟器仍然依赖域。这意味着适应每个新领域需要重新设计和再培训。在这项工作中,我们提议一个基于域的变压器用户模拟器(TUS)。我们的TUS结构没有与特定领域挂钩,因此,允许对域进行概括化,并从数据中学习跨界用户的行为。我们用自动和人类评价来比较TUS与艺术状态。TUS可以与基于规则的用户模拟器在预先定义的域上竞争,并且能够以零点方式将普通化到看不见域。