End-to-end task bots are typically learned over a static and usually limited-size corpus. However, when deployed in dynamic, changing, and open environments to interact with users, task bots tend to fail when confronted with data that deviate from the training corpus, i.e., out-of-distribution samples. In this paper, we study the problem of automatically adapting task bots to changing environments by learning from human-bot interactions with minimum or zero human annotations. We propose SL-AGENT, a novel self-learning framework for building end-to-end task bots. SL-AGENT consists of a dialog model and a pre-trained reward model to predict the quality of an agent response. It enables task bots to automatically adapt to changing environments by learning from the unlabeled human-bot dialog logs accumulated after deployment via reinforcement learning with the incorporated reward model. Experimental results on four well-studied dialog tasks show the effectiveness of SL-AGENT to automatically adapt to changing environments, using both automatic and human evaluations. We will release code and data for further research.
翻译:终端到终端任务机器人通常是在静态和通常有限尺寸的容器中学习的。然而,如果在动态、变化和开放环境中与用户互动,任务机器人在遇到不同于培训主体的数据时往往会失败,即分配外样本。在本文件中,我们研究了通过学习与最小或零人类说明的人体-机器人互动,使任务机器人自动适应变化环境的问题。我们提出了SL-AGENT,这是一个用于建设终端到终端任务机器人的新颖的自学框架。 SL-AGENT包含一个对话模型和一个预先培训的奖励模型,以预测代理方反应的质量。它使任务机器人能够自动适应变化环境,从与整合的奖励模型进行强化学习后积累的未贴标签的人类-机器人对话日志中学习。关于四项经过充分研究的对话框任务的实验结果显示,SL-AGENT可以自动适应变化环境,同时使用自动和人文评价。我们将发布代码和数据以供进一步研究。