Learning from visual data opens the potential to accrue a large range of manipulation behaviors by leveraging human demonstrations without specifying each of them mathematically, but rather through natural task specification. In this paper, we present Learning by Watching (LbW), an algorithmic framework for policy learning through imitation from a single video specifying the task. The key insights of our method are two-fold. First, since the human arms may not have the same morphology as robot arms, our framework learns unsupervised human to robot translation to overcome the morphology mismatch issue. Second, to capture the details in salient regions that are crucial for learning state representations, our model performs unsupervised keypoint detection on the translated robot videos. The detected keypoints form a structured representation that contains semantically meaningful information and can be used directly for computing reward and policy learning. We evaluate the effectiveness of our LbW framework on five robot manipulation tasks, including reaching, pushing, sliding, coffee making, and drawer closing. Extensive experimental evaluations demonstrate that our method performs favorably against the state-of-the-art approaches.
翻译:从视觉数据中学习,通过利用人类演示,而不以数学方式具体地说明其中的每一个,而是通过自然任务规格,从而有可能形成大量操纵行为。在本文中,我们展示了 " 观察学习 " (LbW),这是一个从一个视频中模仿来进行政策学习的算法框架。我们方法的关键洞察力是双重的。第一,由于人类手臂可能没有与机器人手臂相同的形态学,我们的框架学会了不受监督的人类和机器人翻译,以克服形态学不匹配问题。第二,为了捕捉对学习状态表现至关重要的突出区域的细节,我们的模型在翻译的机器人视频上进行不受监督的关键点探测。所探测的关键点形成了一个结构化的代表,其中包含了具有语义意义的信息,可以直接用于计算奖赏和政策学习。我们评估我们的LbW框架在五个机器人操作任务上的有效性,包括达到、推动、滑动、咖啡制造和抽屉关闭。广泛的实验显示,我们的方法优于最先进的方法。