Obtaining policies that can generalise to new environments in reinforcement learning is challenging. In this work, we demonstrate that language understanding via a reading policy learner is a promising vehicle for generalisation to new environments. We propose a grounded policy learning problem, Read to Fight Monsters (RTFM), in which the agent must jointly reason over a language goal, relevant dynamics described in a document, and environment observations. We procedurally generate environment dynamics and corresponding language descriptions of the dynamics, such that agents must read to understand new environment dynamics instead of memorising any particular information. In addition, we propose txt2$\pi$, a model that captures three-way interactions between the goal, document, and observations. On RTFM, txt2$\pi$ generalises to new environments with dynamics not seen during training via reading. Furthermore, our model outperforms baselines such as FiLM and language-conditioned CNNs on RTFM. Through curriculum learning, txt2$\pi$ produces policies that excel on complex RTFM tasks requiring several reasoning and coreference steps.
翻译:在强化学习中,获取能够向新环境概括化的政策是一项挑战。在这项工作中,我们证明,通过阅读政策学习者理解语言是普及到新环境的一个很有希望的工具。我们提出了一个有根有据的政策学习问题,《阅读与战斗怪兽》(RTFM),其中代理商必须共同解释语言目标、文件描述的相关动态和环境观测。我们在程序上产生环境动态和相应的动态语言描述,使代理商必须阅读来理解新的环境动态,而不是回忆任何特定信息。此外,我们提议txt2$\pi$,这是一个捕捉目标、文件和观察之间三路互动的模型。关于RTFM, txt2$\pi$,一般用于在培训过程中没有看到动态的新环境。此外,我们的模型超越了FILM和在RTFM上有语言限制的CNN的基线。通过课程学习, txtt2$\pi$, 产生出在复杂的RTFM任务上优于需要若干推理和协作步骤的政策。