How should we learn visual representations for embodied agents that must see and move? The status quo is tabula rasa in vivo, i.e. learning visual representations from scratch while also learning to move, potentially augmented with auxiliary tasks (e.g. predicting the action taken between two successive observations). In this paper, we show that an alternative 2-stage strategy is far more effective: (1) offline pretraining of visual representations with self-supervised learning (SSL) using large-scale pre-rendered images of indoor environments (Omnidata), and (2) online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules. We call this method Offline Visual Representation Learning (OVRL). We conduct large-scale experiments - on 3 different 3D datasets (Gibson, HM3D, MP3D), 2 tasks (ImageNav, ObjectNav), and 2 policy learning algorithms (RL, IL) - and find that the OVRL representations lead to significant across-the-board improvements in state of art, on ImageNav from 29.2% to 54.2% (+25% absolute, 86% relative) and on ObjectNav from 18.1% to 23.2% (+5.1% absolute, 28% relative). Importantly, both results were achieved by the same visual encoder generalizing to datasets that were not seen during pretraining. While the benefits of pretraining sometimes diminish (or entirely disappear) with long finetuning schedules, we find that OVRL's performance gains continue to increase (not decrease) as the agent is trained for 2 billion frames of experience.
翻译:我们应如何为必须看到和移动的内装剂学习视觉表现? 现状是: 软体中的 Tabula rasa, 即从零到零学习从头学习视觉表现, 同时学习移动, 并可能增加辅助任务( 例如预测连续两次观测之间采取的行动 ) 。 在本文中, 我们显示一个替代的2阶段战略远为效果:(1) 使用大规模预发室内环境图像( Omnidata) 对内装剂进行视觉表现的离线预培训( SSL), 以及 (2) 在长期学习计划下, 学习从头部从头部从头部从头部从头部从头部从头部从头部从头部从头部从头部到头部从头部到头部的视觉表现学习( VOVRL ) 进行大规模实验 - 3D数据集( Gibson, HM3D, MP3D), 2项任务( ImageNav) 和2 政策学习前部算法( RL), 同时发现OVR, 显示我们所训练的图像状态上的比值持续的比值持续改善, 从28. 25%, 和18. 25%, 相对的比值的比值从头部的比值从脚部的比值为28.