Observing a human demonstrator manipulate objects provides a rich, scalable and inexpensive source of data for learning robotic policies. However, transferring skills from human videos to a robotic manipulator poses several challenges, not least a difference in action and observation spaces. In this work, we use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. Thanks to the diversity of this training data, the learned reward function sufficiently generalizes to image observations from a previously unseen robot embodiment and environment to provide a meaningful prior for directed exploration in reinforcement learning. The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective. By conditioning the function on a goal image, we are able to reuse one model across a variety of tasks. Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.
翻译:观察人类演示人操纵的天体提供了丰富、可扩展和廉价的数据源,用于学习机器人政策。然而,将技能从人类视频转移到机器人操纵者,这带来了若干挑战,特别是行动空间和观测空间的差异。在这项工作中,我们使用人类的无标签视频,解决了范围广泛的操作任务。我们使用人类的无标签视频,学习了机器人操纵政策的任务-不可知的奖励功能。由于培训数据的多样性,学习到的奖励功能能够充分概括到从先前看不见的机器人化物和环境的图像观测,从而提供在强化学习中进行定向探索的有意义的前程。学习的回报是基于距离到利用时间调和目标将空间嵌入一个目标的距离。通过将功能设置在目标图像上,我们可以再利用一个模型来完成各种各样的任务。与以前关于利用人类视频教授机器人的工作不同,我们的方法是人类脱线远程(HOLD),既不需要机器人环境中的原始数据,也不需要一套特定任务人类演示,也不需要一套对形态学前定的通信概念。但是,通过对一个稳定的机器人进行模拟的奖励,它只能加速培训,而只能通过模拟手臂完成。