How do we imbue robots with the ability to efficiently manipulate unseen objects and transfer relevant skills based on demonstrations? End-to-end learning methods often fail to generalize to novel objects or unseen configurations. Instead, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship ``cross-pose" and provide a mathematical definition of this concept. We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship (placing a pan into the oven or the mug onto the mug rack). We demonstrate our method's capability to generalize to unseen objects, in some cases after training on only 10 demonstrations in the real world. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments across a number of tasks. Supplementary information and videos can be found at https://sites.google.com/view/tax-pose/home.
翻译:我们如何使机器人具备通过演示有效操作看不见的物体和转让相关技能的能力? 端到端学习方法往往无法对新对象或无形配置进行概括化的描述。 相反,我们注重特定任务在互动对象的相关部分之间构成关系。 我们推测,这种关系是一个可以转换到同一类别中新物体的操作任务的一般概念; 示例包括锅与烤箱的构成关系或杯杯与杯架的构成关系。 我们称这一任务特定姿势关系为“交叉位置”并提供这一概念的数学定义。 我们提议了一个基于愿景的系统,该系统将学会利用学习的跨对象通信来估计特定操作任务的两个对象之间的交叉位置。 估计的交叉位置将用来指导下游运动规划者将物体转换到理想的轮廓关系(将锅放在炉中或杯杯子与杯架上)。 我们展示了我们的方法将隐形物体普遍化的能力, 在某些情况下, 在真实世界的演示中仅培训了10次演示之后, 我们提出一个基于愿景的系统, 模拟了真实世界的图像, 显示我们系统能够实现真实的状态。