In this paper we present a neuro-symbolic (hybrid) compositional reasoning model for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot agent using natural language, providing either a referring expression (REC), a question (VQA) or a grasp action instruction. The model can tackle all cases in a task-agnostic fashion through the utilization of a shared library of primitive skills. Each primitive handles an independent sub-task, such as reasoning about visual attributes, spatial relation comprehension, logic and enumeration, as well as arm control. A language parser maps the input query to an executable program composed of such primitives depending on the context. While some primitives are purely symbolic operations (e.g. counting), others are trainable neural functions (e.g. grounding words to images), therefore marrying the interpretability and systematic generalization benefits of discrete symbolic approaches with the scalability and representational power of deep networks. We generate a synthetic dataset of tabletop scenes to train our approach and perform several evaluation experiments for VQA in the synthetic and a real RGB-D dataset. Results show that the proposed method achieves very high accuracy while being transferable to novel content with few-shot visual fine-tuning. Finally, we integrate our method with a robot framework and demonstrate how it can serve as an interpretable solution for an interactive object picking task, both in simulation and with a real robot.
翻译:在本文中,我们展示了将语言引导的视觉推理与机器人操纵混合在一起的神经 -- -- 共振(交错)的构成推理模型。一个非专家人类用户可以用自然语言促使机器人代理人使用自然语言,提供参考表达(REC)、问题(VQA)或掌握行动指令。该模型可以通过利用一个共享的原始技能图书馆,以任务敏感方式处理所有案例。每个原始人都处理一个独立的交互性子任务,例如视觉属性、空间关系理解、逻辑和查点以及手臂控制等推理。一个语言分析器将输入的查询映射到一个由这种原始人组成的可执行程序。一些原始人是纯粹象征性的动作(REC)、一个问题(VQA),或是一个问题(VQA),其他是可训练的神经功能(例如将文字与图像联系起来),从而将离散的象征性方法的可解释性和代表性与深度网络的可表达力结合起来。我们制作了一个组合式桌面场景来训练我们的方法,并对VQA进行若干次的模拟评估性实验,在选择的直观结果中,同时将我们用一个可移动和可移动的方法来展示一个精确的精确的方法,最终展示一个我们所要展示一个精化的合成和可转换的方法。