Visual imitation learning provides efficient and intuitive solutions for robotic systems to acquire novel manipulation skills. However, simultaneously learning geometric task constraints and control policies from visual inputs alone remains a challenging problem. In this paper, we propose an approach for keypoint-based visual imitation (K-VIL) that automatically extracts sparse, object-centric, and embodiment-independent task representations from a small number of human demonstration videos. The task representation is composed of keypoint-based geometric constraints on principal manifolds, their associated local frames, and the movement primitives that are then needed for the task execution. Our approach is capable of extracting such task representations from a single demonstration video, and of incrementally updating them when new demonstrations become available. To reproduce manipulation skills using the learned set of prioritized geometric constraints in novel scenes, we introduce a novel keypoint-based admittance controller. We evaluate our approach in several real-world applications, showcasing its ability to deal with cluttered scenes, new instances of categorical objects, and large object pose and shape variations, as well as its efficiency and robustness in both one-shot and few-shot imitation learning settings. Videos and source code are available at https://sites.google.com/view/k-vil.
翻译:视觉模拟学习为机器人系统获取新操作技能提供了高效和直观的解决方案。然而,同时从视觉输入中学习几何任务制约和控制政策仍然是一个棘手的问题。在本文件中,我们建议采用基于关键点的视觉仿真(K-VIL)方法,从少量人类演示视频中自动提取稀少、以物体为中心、且以化为独立的任务表达方式。任务表述由基于关键点的几何限制组成,这些限制包括主要元件、其相关的本地框架以及随后执行任务所需的运动原始体。我们的方法能够从单一的演示视频中提取这种任务表述,并在新的演示出现时逐步更新这些表述。在新场景中,利用一套精明的优先几何限制复制操纵技能,我们引入了一个新的基于关键点的接纳控制器。我们在若干现实世界应用程序中评估了我们的方法,展示了它处理杂乱无序的场景象、其直观物体的新事例以及大型物体的形状和形状变化的能力。我们的方法能够从一发图和几张图像/图像学习环境中提取其效率和坚固性。