Object grounding tasks aim to locate the target object in an image through verbal communications. Understanding human command is an important process needed for effective human-robot communication. However, this is challenging because human commands can be ambiguous and erroneous. This paper aims to disambiguate the human's referring expressions by allowing the agent to ask relevant questions based on semantic data obtained from scene graphs. We test if our agent can use relations between objects from a scene graph to ask semantically relevant questions that can disambiguate the original user command. In this paper, we present Incremental Grounding using Scene Graphs (IGSG), a disambiguation model that uses semantic data from an image scene graph and linguistic structures from a language scene graph to ground objects based on human command. Compared to the baseline, IGSG shows promising results in complex real-world scenes where there are multiple identical target objects. IGSG can effectively disambiguate ambiguous or wrong referring expressions by asking disambiguating questions back to the user.
翻译:地面任务的目的是通过口头通信将目标对象定位在图像中。 理解人类指令是有效人- 机器人通信的一个重要过程。 但是, 这一点具有挑战性, 因为人类指令可能含糊不清和错误。 本文旨在模糊人类引用的表达方式, 使代理商能够根据从场景图中获取的语义数据询问相关问题。 我们测试我们的代理商是否可以使用场景图中对象之间的关系来询问能够模糊原始用户指令的语义相关问题。 在本文中, 我们使用Scene 图形( IGSG) 展示了递增定位模式, 这个模型使用图像场图中的语义数据以及语言结构, 从语言场图到基于人类指令的物体。 与基线相比, IGSG 展示了复杂的真实世界场中存在多个相同对象的有希望的结果。 IGSG 可以有效地解析模糊或错误的表达方式, 向用户提出模糊不清的问题 。