We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m.
翻译:具体地说,我们预先用Ego4D人类视频数据集对视觉演示进行了培训,同时使用了时间互动学习、视频语言校正和L1惩罚来鼓励稀有和紧凑的演示。由此产生的演示R3M可以用作下游政策学习的冷冻感知模块。在12个模拟机器人操作任务组合中,我们发现R3M使任务成功率提高了20%以上,而从零到零的训练则增加了10%以上。此外,R3M使Franka Emika Panda手臂能够在仅提供20个演示的真正的、包扎式公寓中学习一系列操作任务。可在https://tinyurl.com/robotr3m上查阅代码和预先培训模型。