Given a natural language instruction and an input scene, our goal is to train a model to output a manipulation program that can be executed by the robot. Prior approaches for this task possess one of the following limitations: (i) rely on hand-coded symbols for concepts limiting generalization beyond those seen during training [1] (ii) infer action sequences from instructions but require dense sub-goal supervision [2] or (iii) lack semantics required for deeper object-centric reasoning inherent in interpreting complex instructions [3]. In contrast, our approach can handle linguistic as well as perceptual variations, end-to-end trainable and requires no intermediate supervision. The proposed model uses symbolic reasoning constructs that operate on a latent neural object-centric representation, allowing for deeper reasoning over the input scene. Central to our approach is a modular structure consisting of a hierarchical instruction parser and an action simulator to learn disentangled action representations. Our experiments on a simulated environment with a 7-DOF manipulator, consisting of instructions with varying number of steps and scenes with different number of objects, demonstrate that our model is robust to such variations and significantly outperforms baselines, particularly in the generalization settings. The code, dataset and experiment videos are available at https://nsrmp.github.io
翻译:根据自然语言教学和输入场景,我们的目标是训练一个模型,输出可由机器人执行的操纵程序,这一任务的先前方法具有以下限制之一:(一)使用手工编码符号来限制超出培训期间所见范围的一般性概念[1 (二)从指令中推断行动序列,但需要密集的次级目标监督[2]或(三)缺乏解释复杂指令所固有的更深的物体中心推理所需的语义[3]。相比之下,我们的方法可以处理语言和概念变异、端到端可训练,不需要中间监督。拟议的模型使用符号推理结构,在潜伏的神经对象中心代表上运作,允许对输入场进行更深入的推理。我们方法的核心是一个模块结构,由分级指令分析器和动作模拟器组成,以学习分解动作演示。我们用7-DOF调控器模拟环境的实验,由不同步骤和场景的指令组成,显示我们的模型对此类变异和大大超出目标的图像设置。在一般的设置中,ALADS/DS/DDDDDR是可用的基准。</s>