Visual imitation learning provides efficient and intuitive solutions for robotic systems to acquire novel manipulation skills. However, simultaneously learning geometric task constraints and control policies from visual inputs alone remains a challenging problem. In this paper, we propose an approach for keypoint-based visual imitation (K-VIL) that automatically extracts sparse, object-centric, and embodiment-independent task representations from a small number of human demonstration videos. The task representation is composed of keypoint-based geometric constraints on principal manifolds, their associated local frames, and the movement primitives that are then needed for the task execution. Our approach is capable of extracting such task representations from a single demonstration video, and of incrementally updating them when new demonstrations become available. To reproduce manipulation skills using the learned set of prioritized geometric constraints in novel scenes, we introduce a novel keypoint-based admittance controller. We evaluate our approach in several real-world applications, showcasing its ability to deal with cluttered scenes, viewpoint mismatch, new instances of categorical objects, and large object pose and shape variations, as well as its efficiency and robustness in both one-shot and few-shot imitation learning settings. Videos and source code are available at https://sites.google.com/view/k-vil.
翻译:视觉模拟学习为机器人系统获取新操作技能提供了高效和直观的解决方案。然而,同时从视觉输入中学习几何任务限制和控制政策仍然是一个棘手的问题。在本文件中,我们建议采用基于关键点的视觉仿真(K-VIL)方法,自动从少量人类演示视频中提取稀少、以物体为中心的和以化为独立的任务表现。任务表现由基于关键点的几何限制组成,主要元件、其相关的本地框架以及执行任务所需的运动原始体。我们的方法能够从单一的演示视频中提取这种任务表现,并在出现新的演示时逐步更新这些表现。在新场景中,利用一套精明的优先几何限制复制操纵技能,我们引入了一个新的基于关键点的接纳控制器。我们在若干现实世界应用程序中评价了我们的方法,展示了它处理杂乱的场景、观点不匹配、直线物体的新事例以及大型物体的出现和形状变异的能力,以及它在一面图和几面图像中的效率和坚固性。在一面图中和几面图像/可获取的版本学习的源码。