We present an approach for physical imitation from human videos for robot manipulation tasks. The key idea of our method lies in explicitly exploiting the kinematics and motion information embedded in the video to learn structured representations that endow the robot with the ability to imagine how to perform manipulation tasks in its own context. To achieve this, we design a perception module that learns to translate human videos to the robot domain followed by unsupervised keypoint detection. The resulting keypoint-based representations provide semantically meaningful information that can be directly used for reward computing and policy learning. We evaluate the effectiveness of our approach on five robot manipulation tasks, including reaching, pushing, sliding, coffee making, and drawer closing. Detailed experimental evaluations demonstrate that our method performs favorably against previous approaches.
翻译:我们从人类视频中为机器人操纵任务提出了一个物理仿制方法。我们方法的关键理念在于明确利用视频中嵌入的运动和运动信息来学习结构化的表达方式,让机器人能够想象如何在自己的背景下执行操纵任务。为此,我们设计了一个感知模块,学会将人类视频转化为机器人域,然后不受监督地检测关键点。由此产生的关键点表达方式提供了可以直接用于奖励计算和政策学习的具有语义意义的信息。我们评估了我们在5个机器人操纵任务上的方法的有效性,包括达到、推动、滑动、咖啡制造和抽屉关闭。详细的实验评估表明,我们的方法与以往的方法相比是有利的。