Sequential manipulation tasks require a robot to perceive the state of an environment and plan a sequence of actions leading to a desired goal state. In such tasks, the ability to reason about spatial relations among object entities from raw sensor inputs is crucial in order to determine when a task has been completed and which actions can be executed. In this work, we propose SORNet (Spatial Object-Centric Representation Network), a framework for learning object-centric representations from RGB images conditioned on a set of object queries, represented as image patches called canonical object views. With only a single canonical view per object and no annotation, SORNet generalizes zero-shot to object entities whose shape and texture are both unseen during training. We evaluate SORNet on various spatial reasoning tasks such as spatial relation classification and relative direction regression in complex tabletop manipulation scenarios and show that SORNet significantly outperforms baselines including state-of-the-art representation learning techniques. We also demonstrate the application of the representation learned by SORNet on visual-servoing and task planning for sequential manipulation on a real robot.
翻译:序列操作任务要求机器人感知环境状态,并计划一系列导致预期目标状态的行动。 在这种任务中,对原始传感器输入的物体实体之间的空间关系进行思考的能力至关重要,以便确定任务何时完成和可以执行哪些行动。在这项工作中,我们提议SORNet(空间物体中心代表网络),这是一个学习以一组物体查询为条件的 RGB 图像的物体中心显示框架,以一套物体查询为条件,作为称为光学天体视图的图像补丁。由于每个物体只有单一的直观,没有注解,SORNet对在训练期间形状和纹理都看不见的物体实体一般地将零光化为零。我们评估SORNet关于各种空间推理任务,例如空间关系分类和复杂桌面操作情景中相对方向回归,并显示SORNet大大超越了基线,包括状态-艺术代表学习技术。我们还演示了SORNet在视觉观察和任务规划中对真实机器人进行连续操纵方面学到的演示。