To realize robots that can understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that can understand referential language to identify common objects in real-world 3D scenes. In this paper, we develop a spatial-language model for a 3D visual grounding problem. Specifically, given a reconstructed 3D scene in the form of a point cloud with 3D bounding boxes of potential object candidates, and a language utterance referring to a target object in the scene, our model identifies the target object from a set of potential candidates. Our spatial-language model uses a transformer-based architecture that combines spatial embedding from bounding-box with a finetuned language embedding from DistilBert and reasons among the objects in the 3D scene to find the target object. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D. We provide additional analysis of performance in spatial reasoning tasks decoupled from perception noise, the effect of view-dependent utterances in terms of accuracy, and view-point annotations for potential robotics applications.
翻译:为了在不远的将来实现能够理解人的指示和完成有意义的任务的机器人,必须开发能够理解优选语言以识别现实世界 3D 场景中常见物体的学习模型。 在本文中,我们为三维视觉定位问题开发了空间语言模型。 具体地说, 3D场景以三维潜在对象候选人三维捆绑盒的点云形式重建了三维场景, 并使用提及场景中目标对象的语言表达方式, 我们的模型从一组潜在对象中确定了目标对象。 我们的空间语言模型使用一种基于变压器的架构, 将空间嵌入框与3D 场景中对象之间的细化语言嵌入结合起来, 以及寻找目标对象的原因。 我们展示了我们的模型在与 ExinviewIt3D 提议的相对语言数据集上的竞争性表现。 我们对空间推理任务的表现进行了进一步分析, 与感知音噪音脱钩、 视光表达效果的效果以及潜在机器人应用的视图点说明结合起来。