The ability to handle objects in cluttered environment has been long anticipated by robotic community. However, most of works merely focus on manipulation instead of rendering hidden semantic information in cluttered objects. In this work, we introduce the scene graph for embodied exploration in cluttered scenarios to solve this problem. To validate our method in cluttered scenario, we adopt the Manipulation Question Answering (MQA) tasks as our test benchmark, which requires an embodied robot to have the active exploration ability and semantic understanding ability of vision and language.As a general solution framework to the task, we propose an imitation learning method to generate manipulations for exploration. Meanwhile, a VQA model based on dynamic scene graph is adopted to comprehend a series of RGB frames from wrist camera of manipulator along with every step of manipulation is conducted to answer questions in our framework.The experiments on of MQA dataset with different interaction requirements demonstrate that our proposed framework is effective for MQA task a representative of tasks in cluttered scenario.
翻译:机器人社区早就预见到在杂乱的环境中处理物体的能力。 然而,大多数工程只是侧重于操纵,而不是在杂乱乱乱乱乱的物体中提供隐藏的语义信息。 在这项工作中,我们引入了在杂乱的情景中进行体现勘探的场景图,以解决这一问题。为了在杂乱的情景中验证我们的方法,我们采用了操纵问答(MQA)任务作为我们的测试基准,这要求一个装有机器人的机器人具备积极的探索能力和对视觉和语言的语义理解能力。作为任务的一般解决方案框架,我们建议了一种模拟学习方法,以生成用于探索的操纵。同时,我们采用了一个基于动态场景图的VQA模型,以理解一系列来自操纵器手腕相机的 RGB 框架以及每一个操作步骤,以解答我们框架中的问题。 不同互动要求的MQA数据集实验表明,我们提议的框架对于MQA任务具有有效性,一位在杂乱乱的情景中的任务代表。