We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data scale and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Finally, we show that task or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. These models required over 10,000 GPU-hours to train and can be found on our website for the benefit of the research community.
翻译:我们提出了最大且最全面的预训练视觉表示(PVR)或视觉“基础模型”对于机体人工智能的实证研究。首先,我们创建了 CortexBench,包含横跨运动、导航、巧妙和移动操作的 17 种不同任务。接下来,我们系统地评估现有的 PVR 并发现没有一种万能的优势模型。为了研究预训练数据规模和多样性的影响,我们结合了来自 7 个不同源(超过 560 万图像)的 egocentric 视频和 ImageNet 数据集,使用遮掩自编码(MAE)在这些数据的切片上训练不同大小的 vision transformers。与先前的工作推断相反,我们发现扩大数据集规模和多样性并不能普遍改善表现(但平均而言确实如此)。我们最大的模型名为 VC-1,在平均性能上优于所有以前的 PVR,但也没有普遍的支配力。最后,我们展示了 VC-1 的任务或领域特定适应会带来大幅度的收益,VC-1(适应)在 CortexBench 的所有基准测试中都实现了竞争性或优于已知的最佳结果。这些模型需要超过 10,000 个 GPU 小时才能训练,并可以在我们的网站上找到,以造福研究社区。