Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce $\textbf{V}$alue-$\textbf{I}$mplicit $\textbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and $\textbf{real-robot}$ tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, $\textbf{few-shot}$ offline RL on a suite of real-world robot tasks with as few as 20 trajectories.
翻译:在从感官观察中学习日益扩大的机器人操纵技能方面,{奖励和代表性学习是两个长期的挑战。鉴于内在成本和稀缺的内置、任务专用机器人数据,从大型、多样化、离线人类视频中学习,成为获取普遍有用的视觉演示以用于控制的一个大有希望的道路;然而,这些人类视频如何用于普通用途奖励学习,仍然是一个未决问题。我们引入了$\ textbf{V}$美元(美元)-美元/ textb{I}美元(美元),用于感官观测。鉴于内在成本和稀缺的内置、特定任务前受训练的视觉演示(VIP),能够产生浓厚和顺畅的视觉表现功能。VIP代表从人类视频中学习,作为离线强化目标学习问题,并产生一个不依赖于行动的双重目标设定值设定值值值的功能目标。理论上,允许对未经贴标签的人类视频进行预先培训,可以将贵宾理解为基于新颖的隐含时间定位目标,从而产生时间平稳的嵌入,使精度功能能够通过深的直观的直观显示功能,使任何稳定的视觉代表能够通过存储的Eloveregole dal dalalal 。</s>