Recent unsupervised pre-training methods have shown to be effective on language and vision domains by learning useful representations for multiple downstream tasks. In this paper, we investigate if such unsupervised pre-training methods can also be effective for vision-based reinforcement learning (RL). To this end, we introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos. Our framework consists of two phases: we pre-train an action-free latent video prediction model, and then utilize the pre-trained representations for efficiently learning action-conditional world models on unseen environments. To incorporate additional action inputs during fine-tuning, we introduce a new architecture that stacks an action-conditional latent prediction model on top of the pre-trained action-free prediction model. Moreover, for better exploration, we propose a video-based intrinsic bonus that leverages pre-trained representations. We demonstrate that our framework significantly improves both final performances and sample-efficiency of vision-based RL in a variety of manipulation and locomotion tasks. Code is available at https://github.com/younggyoseo/apv.
翻译:最近未经监督的训练前方法表明,通过学习多种下游任务的有用表述,在语言和愿景领域最近出现的效果。在本文件中,我们调查这种未经监督的训练前方法是否也能对基于愿景的强化学习有效(RL)。为此,我们引入了一个框架,通过对视频进行基因化的训练前培训,学习有助于理解动态的表示。我们的框架由两个阶段组成:我们预先培训一个没有行动的潜在视频预测模型,然后利用培训前的表述,高效率地学习关于隐性环境的成熟世界模式。为了在微调期间纳入更多的行动投入,我们在预先培训的无行动预测模型上堆积了一种行动条件的潜伏预测模型。此外,为了更好的探索,我们提出了一种基于视频的内在红利,利用预先培训前的表述。我们证明,我们的框架在各种操纵和 Locomotion 任务中大大改进了基于愿景的世界模型的最后性能和抽样效率。代码可在 https://github.com/engyoseo/apv上查阅。