We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for images, we aim to jointly encode a series of images gathered by an agent as it moves about in 3D environments. We learn these representations via a zone prediction task, where we intelligently mask out portions of an agent's trajectory and predict them from the unmasked portions, conditioned on the agent's camera poses. By learning such representations on a collection of videos, we demonstrate successful transfer to multiple downstream navigation-oriented tasks. Our experiments on the photorealistic 3D environments of Gibson and Matterport3D show that our method outperforms the state-of-the-art on challenging tasks with only a limited budget of experience.
翻译:我们引入了环境预测编码,这是一种自我监督的方法,用于学习被显示的物剂的环境层面表现。与以前关于自我监督的图像学习的工作不同,我们的目标是在3D环境中移动时将一个物剂收集的一系列图像共同编码。我们通过一个区域预测任务来学习这些表现,我们明智地遮盖一个物剂轨道的部分,并用该物剂的摄像头来预测它们。我们通过在收集视频时了解这种表现,展示了向多个下游导航导向任务的成功转移。我们对吉布森和Metterport3D的摄影现实3D环境的实验显示,我们的方法在挑战性任务方面超过了最先进的技术,只有有限的经验预算。