Learning task-oriented dialog policies via reinforcement learning typically requires large amounts of interaction with users, which in practice renders such methods unusable for real-world applications. In order to reduce the data requirements, we propose to leverage data from across different dialog domains, thereby reducing the amount of data required from each given domain. In particular, we propose to learn domain-agnostic action embeddings, which capture general-purpose structure that informs the system how to act given the current dialog context, and are then specialized to a specific domain. We show how this approach is capable of learning with significantly less interaction with users, with a reduction of 35% in the number of dialogs required to learn, and to a higher level of proficiency than training separate policies for each domain on a set of simulated domains.
翻译:通过强化学习学习以学习任务为导向的对话政策通常需要与用户进行大量互动,这实际上使得这些方法无法用于现实世界应用。为了减少数据需求,我们提议利用不同对话领域的数据,从而减少每个特定领域所需的数据数量。特别是,我们提议学习域-不可知行动嵌入,该嵌入包含通用结构,根据当前对话背景为系统提供如何操作的信息,然后专门用于某个特定领域。我们表明,这种方法如何能够与用户学习少得多的互动,学习所需的对话数量减少35%,熟练程度高于为每个领域在一组模拟领域培训单独政策的能力。