Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information. To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.
翻译:视觉和语言导航(VLN)是旨在为通用机器人铺路的前沿研究,是计算机视觉和自然语言处理界的一个热题。VLN任务要求代理商在不熟悉的环境中按照自然语言指令导航到目标位置。最近,基于变压器的模型在VLN任务方面有了重大改进。由于变压器结构的注意机制可以更好地整合视觉和语言的内和内型信息。然而,目前基于变压器的模型存在两个问题。 (1) 模型在不考虑对象完整性的情况下独立处理每个观点。(2) 在视觉模式的自我注意操作期间,空间上遥远的视图可以不受明确限制地相互交织。这种混合可能会带来额外的噪音,而不是有用的信息。为了解决这些问题,我们建议:(1) 基于位置的模块可以纳入来自同一对象的分解信息。(2) 本地注意掩码机制可以限制视觉关注范围。拟议的模块可以很容易地插入任何VLRN结构,在视觉模式运行期间,我们使用ODVLN-BS的模型展示了我们已实现的数据基础状态。