Learning generalizable policies that can adapt to unseen environments remains challenging in visual Reinforcement Learning (RL). Existing approaches try to acquire a robust representation via diversifying the appearances of in-domain observations for better generalization. Limited by the specific observations of the environment, these methods ignore the possibility of exploring diverse real-world image datasets. In this paper, we investigate how a visual RL agent would benefit from the off-the-shelf visual representations. Surprisingly, we find that the early layers in an ImageNet pre-trained ResNet model could provide rather generalizable representations for visual RL. Hence, we propose Pre-trained Image Encoder for Generalizable visual reinforcement learning (PIE-G), a simple yet effective framework that can generalize to the unseen visual scenarios in a zero-shot manner. Extensive experiments are conducted on DMControl Generalization Benchmark, DMControl Manipulation Tasks, Drawer World, and CARLA to verify the effectiveness of PIE-G. Empirical evidence suggests PIE-G improves sample efficiency and significantly outperforms previous state-of-the-art methods in terms of generalization performance. In particular, PIE-G boasts a 55% generalization performance gain on average in the challenging video background setting. Project Page: https://sites.google.com/view/pie-g/home.
翻译:在视觉强化学习(RL)中,可以适应无形环境的学习一般政策仍然具有挑战性。现有的方法试图通过使现场观测外观多样化来获得一种强大的代表性,以更好地概括化。这些方法受到环境的具体观察的限制,忽视了探索各种真实世界图像数据集的可能性。在本文中,我们调查视觉RL代理将如何从现成的视觉展示中受益。我们惊讶地发现,在经过预先培训的图像网络ResNet模型中的早期层可以提供相当普遍的展示。因此,我们提议为一般可见强化学习(PIE-G)提供事先训练过的图像编码器,这是一个简单而有效的框架,可以零球化地概括到各种看不见的视觉情景。在DMRCM控制一般化基准、DMC control操纵任务、Laster World和CARLA上进行了广泛的实验,以核实PIE-G的有效性。EPII-G的证据表明,其样本效率和显著超出以往的状态。因此,我们建议采用通用的视觉强化方法,在一般业绩中采用55GSO/CWADG 上具有挑战性。