Sequential manipulation tasks require a robot to perceive the state of an environment and plan a sequence of actions leading to a desired goal state, where the ability to reason about spatial relationships among object entities from raw sensor inputs is crucial. Prior works relying on explicit state estimation or end-to-end learning struggle with novel objects or new tasks. In this work, we propose SORNet (Spatial Object-Centric Representation Network), which extracts object-centric representations from RGB images conditioned on canonical views of the objects of interest. We show that the object embeddings learned by SORNet generalize zero-shot to unseen object entities on three spatial reasoning tasks: spatial relationship classification, skill precondition classification and relative direction regression, significantly outperforming baselines. Further, we present real-world robotic experiments demonstrating the usage of the learned object embeddings in task planning for sequential manipulation.
翻译:序列操作任务要求机器人感知环境状态,并计划一系列行动,导致一个理想的目标状态,即能够从原始传感器输入中了解物体实体之间的空间关系至关重要。 先前的工作依赖于明确的国家估计或与新物体或新任务进行端到端的学习斗争。 在此工作中,我们提议SORNet(空间物体中心代表网络),它从 RGB 图像中提取以对感兴趣对象的直观观点为条件的物体中心表示。 我们显示, SORNet 所学的物体嵌入的物体在三种空间推理任务上将零射向看不见的物体实体:空间关系分类、技能先决条件分类和相对方向回归,显著超过基线。 此外,我们介绍真实世界机器人实验,展示在连续操纵任务规划中如何使用所学到的物体嵌入。