从“在野中”人类视频中学习通用机器人奖励功能 (Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos)

We are motivated by the goal of generalist robots that can complete a wide range of tasks across many environments. Critical to this is the robot's ability to acquire some metric of task success or reward, which is necessary for reinforcement learning, planning, or knowing when to ask for help. For a general-purpose robot operating in the real world, this reward function must also be able to generalize broadly across environments, tasks, and objects, while depending only on on-board sensor observations (e.g. RGB images). While deep learning on large and diverse datasets has shown promise as a path towards such generalization in computer vision and natural language, collecting high quality datasets of robotic interaction at scale remains an open challenge. In contrast, "in-the-wild" videos of humans (e.g. YouTube) contain an extensive collection of people doing interesting tasks across a diverse range of settings. In this work, we propose a simple approach, Domain-agnostic Video Discriminator (DVD), that learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task, and can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos. We find that by leveraging diverse human datasets, this reward function (a) can generalize zero shot to unseen environments, (b) generalize zero shot to unseen tasks, and (c) can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.

翻译：我们的动力来自能够在许多环境中完成一系列广泛任务的通用机器人的目标。关键在于机器人有能力获得某些任务成功或奖励的衡量标准,这是加强学习、规划或了解何时请求帮助所必需的。对于在现实世界中运作的通用机器人来说,这种奖励功能还必须能够广泛分布在环境、任务和对象之间,同时仅取决于在机载传感器上的观测(如 RGB 图像 ) 。虽然对大型和多样化数据集的深入学习显示有希望成为在计算机视觉和自然语言中实现这种普遍化的途径,但收集高品质的机器人互动的视觉数据集仍然是公开的挑战。相比之下, “ 虚拟” 人类的视频( 如YouTube ) 包含大量在各种环境中执行有趣任务的人。在这项工作中,我们建议一种简单的方法, Domain- Agnistical Verimicriminator (DVDD), 通过训练一个小分析员来区分两个视频是否执行同样的任务, 高品质的直观的视觉数据集,我们可以通过一个普遍的机器人来利用普通数据功能, 。