This paper focuses on robotic reinforcement learning with sparse rewards for natural language goal representations. An open problem is the sample-inefficiency that stems from the compositionality of natural language, and from the grounding of language in sensory data and actions. We address these issues with three contributions. We first present a mechanism for hindsight instruction replay utilizing expert feedback. Second, we propose a seq2seq model to generate linguistic hindsight instructions. Finally, we present a novel class of language-focused learning tasks. We show that hindsight instructions improve the learning performance, as expected. In addition, we also provide an unexpected result: We show that the learning performance of our agent can be improved by one third if, in a sense, the agent learns to talk to itself in a self-supervised manner. We achieve this by learning to generate linguistic instructions that would have been appropriate as a natural language goal for an originally unintended behavior. Our results indicate that the performance gain increases with the task-complexity.
翻译:本文侧重于机器人强化学习,对自然语言目标表达的回报微乎其微。一个公开的问题是自然语言的构成和在感官数据和行动中将语言定位为基底所产生的抽样效率低下问题。我们用三种贡献处理这些问题。我们首先提出后视教学机制,利用专家反馈重新播放后见教学。第二,我们提出一个后视模式,以产生语言后见教学指令。最后,我们提出一个以语言为重点的新颖学习任务类别。我们显示后见指令改善了学习绩效,正如预期的那样。我们还提供了一个出乎意料的结果:我们显示,如果代理人学会以自我监督的方式自我交流,那么其学习成绩可以提高三分之一。我们通过学习生成语言指导,作为最初无意行为的一个自然语言目标,我们通过学习产生语言指导。我们的结果显示,随着任务兼容性的增加,业绩会提高三分之一。