In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.
翻译:在这项工作中,我们探索了自我监督的视觉预培训,以了解来自各种真实世界机器人任务现场视频的图像。和以前的工作一样,我们的视觉演示通过一个蒙面自动编码器(MAE)进行预先训练,被冻结,然后通过一个可学习的控制模块进行。与以前的工作不同,我们显示,预培训演示在一系列真实世界机器人任务和化体中是有效的。我们发现,我们的编码器始终超越了CLIP(高达75% ), 监督的图像网络预培训(高达81% ), 以及从零到零的训练(高达81% ) 。 最后,我们培训了一个307M 参数变异器, 用于大量收集来自互联网和以自我为中心的视频的4.5M图像, 并清晰地展示了扩大机器人学习的视觉预培训的好处。