Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to captioning and image manipulation. We investigate the effectiveness of CLIP visual backbones for Embodied AI tasks. We build incredibly simple baselines, named EmbCLIP, with no task specific architectures, inductive biases (such as the use of semantic maps), auxiliary tasks during training, or depth maps -- yet we find that our improved baselines perform very well across a range of tasks and simulators. EmbCLIP tops the RoboTHOR ObjectNav leaderboard by a huge margin of 20 pts (Success Rate). It tops the iTHOR 1-Phase Rearrangement leaderboard, beating the next best submission, which employs Active Neural Mapping, and more than doubling the % Fixed Strict metric (0.08 to 0.17). It also beats the winners of the 2021 Habitat ObjectNav Challenge, which employ auxiliary tasks, depth maps, and human demonstrations, and those of the 2019 Habitat PointNav Challenge. We evaluate the ability of CLIP's visual representations at capturing semantic information about input observations -- primitives that are useful for navigation-heavy embodied tasks -- and find that CLIP's representations encode these primitives more effectively than ImageNet-pretrained backbones. Finally, we extend one of our baselines, producing an agent capable of zero-shot object navigation that can navigate to objects that were not used as targets during training. Our code and models are available at https://github.com/allenai/embodied-clip
翻译:对比语言图像预训练( CLIP) 编程前的编程显示, 有助于从分类和检测到字幕和图像操作等一系列视觉任务。 我们调查 CLIP 视觉主干网对 Embudid AI 任务的有效性。 我们构建了令人难以置信的简单基线, 名为 EmbCLIP 的 EmbCLIP, 没有任务特定的架构, 诱导偏差( 如使用语义地图), 培训中的辅助任务或深度地图, 但我们发现, 我们改进过的基线在一系列任务和模拟器中表现得非常好。 EmbCLIP 顶端的RobotTHOR 目标顶端, 以20 pt 的宽度( 超速率 ) 来顶顶端的 CLIP 的视觉主干网列。 我们的Silliformormalorum Adexmal 正在有效地评估 CDIPL 的精度, 也就是我们所利用的直观化的CIIP 的直观图解图解图解成的CL 。