Grounded understanding of natural language in physical scenes can greatly benefit robots that follow human instructions. In object manipulation scenarios, existing end-to-end models are proficient at understanding semantic concepts, but typically cannot handle complex instructions involving spatial relations among multiple objects. which require both reasoning object-level spatial relations and learning precise pixel-level manipulation affordances. We take an initial step to this challenge with a decoupled two-stage solution. In the first stage, we propose an object-centric semantic-spatial reasoner to select which objects are relevant for the language instructed task. The segmentation of selected objects are then fused as additional input to the affordance learning stage. Simply incorporating the inductive bias of relevant objects to a vision-language affordance learning agent can effectively boost its performance in a custom testbed designed for object manipulation with spatial-related language instructions.
翻译:基于语境的语言条件机器人操纵中物体关系的基础设施建设,具有语义空间推理
翻译后的摘要:
在物体操作场景中,天然语言的基础设施建设在物理场景中得到了良好的应用,其对遵循人类指令的机器人有很大的益处。在物体操作场景中,现有的端到端模型擅长理解语义概念,但往往无法处理涉及多个物体的空间关系的复杂指令。可从物体层面的空间关系的推理中学习精确的像素级操作识别能力。我们提出了一个分步骤的解决方案,首先,我们提出了一个基于物体中心的语义-空间推理器,以选择哪些对于符合语言指示的任务是相关的物体。选定的物品的分割也作为额外的输入融合到操作识别学习阶段。只需将相关物体的归纳偏向加入视觉-语言操作识别学习代理,即可在用于空间相关语言指令的自定义测试平台上有效提升其性能。