Service robots should be able to interact naturally with non-expert human users, not only to help them in various tasks but also to receive guidance in order to resolve ambiguities that might be present in the instruction. We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description. Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains, therefore relying heavily on large datasets. Additionally, their transfer performance in RGB-D datasets suffers due to high visual discrepancy between the benchmark and the target domains. Modular approaches marry learning with domain modeling and exploit the compositional nature of language to decouple visual representation from language parsing, but either rely on external parsers or are trained in an end-to-end fashion due to the lack of strong supervision. In this work, we seek to tackle these limitations by introducing a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations. We exploit rich scene graph annotations generated in a synthetic domain and train each module independently. Our approach is evaluated both in simulation and in two real RGB-D scene datasets. Experimental results show that the decoupled nature of our framework allows for easy integration with domain adaptation approaches for Sim-To-Real visual recognition, offering a data-efficient, robust, and interpretable solution to visual grounding in robotic applications.
翻译:机器人应该能够自然地与非专家人类用户互动,不仅帮助他们完成各种任务,而且接受指导,以便解决指令中可能存在的模糊不清。我们考虑视觉地面任务,即代理人部分是一个来自拥挤的场景的自然语言描述对象。视觉地面的现代整体方法通常忽视语言结构,并努力覆盖通用领域,因此严重依赖大型数据集。此外,他们在 RGB-D 数据集中的传输性能由于基准和目标域之间的视觉差异很大而受到影响。模块方法将学习与域模型相结合,并利用语言的构成性质来将视觉代表与语言分析脱钩,但或者依赖外部剖析器,或者由于缺乏强有力的监督而以端对端方式进行培训。在这项工作中,我们力求通过一个完全分解的模块框架来克服这些局限性,用于实体、属性和空间关系的配置。我们利用在一个合成域内生成的丰富的场景图解图说明,并独立地培训每个模块。我们的方法是在模拟和两种直观视觉图像模型应用中加以评估,以便用真实的 RGB-D 模拟和直观模型化的图像数据解释,从而展示我们的域域域图结果。