For robots to understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that comprehend referential language to identify common objects in real-world 3D scenes. In this paper, we introduce a spatial-language model for a 3D visual grounding problem. Specifically, given a reconstructed 3D scene in the form of point clouds with 3D bounding boxes of potential object candidates, and a language utterance referring to a target object in the scene, our model successfully identifies the target object from a set of potential candidates. Specifically, LanguageRefer uses a transformer-based architecture that combines spatial embedding from bounding boxes with fine-tuned language embeddings from DistilBert to predict the target object. We show that it performs competitively on visio-linguistic datasets proposed by ReferIt3D. Further, we analyze its spatial reasoning task performance decoupled from perception noise, the accuracy of view-dependent utterances, and viewpoint annotations for potential robotics applications.
翻译:对于机器人来说,在不久的将来理解人的指示并完成有意义的任务,重要的是要开发理解优选语言以识别现实世界 3D 场景中常见物体的学习模型。 在本文中,我们引入了3D视觉定位问题的空间语言模型。 具体地说,鉴于三维场景以点云的形式与三维潜在对象候选方的立体捆绑盒相重建,以及提及场景中目标对象的语言表述,我们的模型成功地从一组潜在对象中确定了目标对象。 具体地说, 语言Refer 使用基于变压器的架构,将捆绑框中的空间嵌入与DistilBert 的精细调整语言嵌入结合起来,以预测目标对象。 我们显示,它具有竞争力地运行了三维维的数据集。 此外,我们分析了其空间推理工作性与感知噪音、依赖视觉的言词的准确性以及潜在机器人应用的视角说明脱钩。