Utilizing amortized variational inference for latent-action reinforcement learning (RL) has been shown to be an effective approach in Task-oriented Dialogue (ToD) systems for optimizing dialogue success. Until now, categorical posteriors have been argued to be one of the main drivers of performance. In this work we revisit Gaussian variational posteriors for latent-action RL and show that they can yield even better performance than categoricals. We achieve this by simplifying the training procedure and propose ways to regularize the latent dialogue policy to retain good response coherence. Using continuous latent representations our model achieves state of the art dialogue success rate on the MultiWOZ benchmark, and also compares well to categorical latent methods in response coherence.
翻译:在以任务为导向的对话(ToD)系统中,利用摊销式变异推论加强潜伏行动学习(RL)已证明是优化对话成功率的有效方法,迄今为止,直截了当的后继者一直被认为是业绩的主要驱动因素之一。在这项工作中,我们重新审视了潜伏行动RL的高斯变异后继者,并表明他们能够比绝对值更好的表现。我们通过简化培训程序,提出使潜在对话政策正规化以保持良好的应对一致性的方法,实现了这一点。我们的模式利用连续的潜伏表达方式实现了多功能区基准的艺术对话成功率,并与直截了当的潜在应对方法进行了很好的比较。