Despite recent breakthroughs in reinforcement learning (RL) and imitation learning (IL), existing algorithms fail to generalize beyond the training environments. In reality, humans can adapt to new tasks quickly by leveraging prior knowledge about the world such as language descriptions. To facilitate the research on language-guided agents with domain adaption, we propose a novel zero-shot compositional policy learning task, where the environments are characterized as a composition of different attributes. Since there are no public environments supporting this study, we introduce a new research platform BabyAI++ in which the dynamics of environments are disentangled from visual appearance. At each episode, BabyAI++ provides varied vision-dynamics combinations along with corresponding descriptive texts. To evaluate the adaption capability of learned agents, a set of vision-dynamics pairings are held-out for testing on BabyAI++. Unsurprisingly, we find that current language-guided RL/IL techniques overfit to the training environments and suffer from a huge performance drop when facing unseen combinations. In response, we propose a multi-modal fusion method with an attention mechanism to perform visual language-grounding. Extensive experiments show strong evidence that language grounding is able to improve the generalization of agents across environments with varied dynamics.
翻译:零样本组合策略学习:语言基础的方法
尽管近年来在强化学习(RL)和模仿学习(IL)方面取得了突破,但现有算法无法推广到超出训练环境的范围。事实上,人类通过使用关于世界的先前知识(例如语言描述)可以快速适应新任务。为了便于研究基于语言引导的代理人具有领域适应性,我们提出了一种新颖的零样本组合策略学习任务,其中环境被表征为不同属性的组合。由于没有公共的环境支持这项研究,我们引入了一个新的研究平台BabyAI++,其中环境的动态性被从视觉外观中分解出来。在每个集数中,BabyAI++ 提供了多种视觉-动态组合,以及相应的描述性文本。为了评估学习代理人的适应能力,一组视觉-动态对被保留,以便在BabyAI++上进行测试。不出所料,我们发现目前的语言引导 RL/IL 技术会过度适应训练环境,并在面对看不见的组合时产生巨大的性能下降。因此,我们提出了一种具有注意机制的多模式融合方法,以执行视觉语言基础。广泛的实验表明,语言基础能够提高代理人在具有不同动态环境下的泛化能力。