In the Vision-and-Language Navigation task, the embodied agent follows linguistic instructions and navigates to a specific goal. It is important in many practical scenarios and has attracted extensive attention from both computer vision and robotics communities. However, most existing works only use RGB images but neglect the 3D semantic information of the scene. To this end, we develop a novel self-supervised training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation. Specifically, a region query task is designed as the pretext task, which predicts the presence or absence of objects of a particular class in a specific 3D region. Then, we construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs. Experiments show that the proposed approach achieves success rates of 68% and 66% on the validation unseen and test unseen splits of the R2R dataset respectively, which are superior to most of RGB-based methods utilizing vision-language transformers.
翻译:在视野和语言导航任务中,体现的物剂遵循语言指令和导航到某一特定目标,在许多实际情景中非常重要,吸引了计算机视觉和机器人社区的广泛关注。然而,大多数现有作品只使用 RGB 图像,而忽略了现场的3D 语义信息。为此,我们开发了一个全新的自我监督培训框架,将 voxel 级3D 语义重建编码为 3D 语义表达式。具体地说,区域查询任务被设计为托辞任务,它预测特定3D 区域中某一类对象的存在或不存在。然后,我们建造了一个基于 LSTM 的导航模型,用3D 语义表达法和 BERT 语言功能来培训它。实验表明,拟议的方法在R2R 数据集的可视和测试的无形分裂中分别取得了68%和66%的成功率,这些成功率优于大多数基于 RGB 方法使用视觉语言转换器。