Observing a human demonstrator manipulate objects provides a rich, scalable and inexpensive source of data for learning robotic policies. However, transferring skills from human videos to a robotic manipulator poses several challenges, not least a difference in action and observation spaces. In this work, we use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. Thanks to the diversity of this training data, the learned reward function sufficiently generalizes to image observations from a previously unseen robot embodiment and environment to provide a meaningful prior for directed exploration in reinforcement learning. We propose two methods for scoring states relative to a goal image: through direct temporal regression, and through distances in an embedding space obtained with time-contrastive learning. By conditioning the function on a goal image, we are able to reuse one model across a variety of tasks. Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.
翻译:在这项工作中,我们使用人类的未贴标签的视频解决了范围广泛的操纵任务,以学习机器人操纵政策的任务-不可知的奖励功能。由于这种培训数据的多样性,所学的奖励功能足以将先前看不见的机器人化物和环境的观测结果与图像相提并论,从而为在强化学习中进行定向探索提供一个有意义的前程。我们提出了两种相对于目标图像的评分状态方法:通过直接的时间回归和通过时间调试学习获得的嵌入空间的距离。我们通过调整目标图像的功能,可以重新利用一种模型来完成多种任务。与以前利用人类视频教授机器人的工作不同,我们的方法,人类脱线远程(HOLD)既不需要机器人环境的先前数据,也不需要一套任务特定的人类演示,也不需要一种通过模拟的机械化来加速完成,而只能通过模拟的机械化来完成。</s>