Reinforcement learning from large-scale offline datasets provides us with the ability to learn policies without potentially unsafe or impractical exploration. Significant progress has been made in the past few years in dealing with the challenge of correcting for differing behavior between the data collection and learned policies. However, little attention has been paid to potentially changing dynamics when transferring a policy to the online setting, where performance can be up to 90% reduced for existing methods. In this paper we address this problem with Augmented World Models (AugWM). We augment a learned dynamics model with simple transformations that seek to capture potential changes in physical properties of the robot, leading to more robust policies. We not only train our policy in this new setting, but also provide it with the sampled augmentation as a context, allowing it to adapt to changes in the environment. At test time we learn the context in a self-supervised fashion by approximating the augmentation which corresponds to the new environment. We rigorously evaluate our approach on over 100 different changed dynamics settings, and show that this simple approach can significantly improve the zero-shot generalization of a recent state-of-the-art baseline, often achieving successful policies where the baseline fails.
翻译:从大型离线数据集中强化学习,使我们有能力在不进行可能不安全或不切实际的探索的情况下学习政策。在过去几年里,在处理纠正数据收集和学习政策之间不同行为的挑战方面取得了重大进展。然而,在将政策转移到在线设置时,很少注意潜在的变化动态,因为现有方法的性能可以降低90%。在本文件中,我们用扩大世界模型(AugWM)来解决这个问题。我们用简单的转换来强化一个学习的动态模型,寻求捕捉机器人物理特性的潜在变化,从而导致更强有力的政策。我们不仅在这一新环境下培训了我们的政策,而且还将抽样增强作为背景,使其能够适应环境的变化。在测试时,我们通过适应与新环境相适应的增强,以自我监督的方式了解环境。我们严格评估了100多个不同变异的动态环境,并表明这种简单的方法可以大大改进最近最先进的基线的零光谱化,常常在基线失败的地方成功实施政策。