In recent years, developing AI for robotics has raised much attention. The interaction of vision and language of robots is particularly difficult. We consider that giving robots an understanding of visual semantics and language semantics will improve inference ability. In this paper, we propose a novel method-VSGM (Visual Semantic Graph Memory), which uses the semantic graph to obtain better visual image features, improve the robot's visual understanding ability. By providing prior knowledge of the robot and detecting the objects in the image, it predicts the correlation between the attributes of the object and the objects and converts them into a graph-based representation; and mapping the object in the image to be a top-down egocentric map. Finally, the important object features of the current task are extracted by Graph Neural Networks. The method proposed in this paper is verified in the ALFRED (Action Learning From Realistic Environments and Directives) dataset. In this dataset, the robot needs to perform daily indoor household tasks following the required language instructions. After the model is added to the VSGM, the task success rate can be improved by 6~10%.
翻译:近些年来,开发机器人的人工智能引起了人们的极大关注。机器人的视觉和语言的相互作用特别困难。我们认为,让机器人了解视觉语义和语言语义将提高推论能力。在本文中,我们建议采用新颖的方法VSGM(视觉语义图像内存),使用语义图获得更好的视觉图像特征,提高机器人的视觉理解能力。通过提供机器人先前的知识并探测图像中的天体,它预测了天体属性与天体的关联性,并将其转换成图形表示法;在图像中绘制天体图,成为自上而下的自我中心图。最后,当前任务的重要对象特征由图形神经网络提取。本文中提议的方法在ALFRED(从现实环境和指令中学习的行动)数据集中得到验证。在这个数据集中,机器人需要按照所需的语言指示执行日常室内任务。在VSGGM中添加模型后,任务成功率可以提高6-10%。