Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving. Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction via the multimodal cross-attention mechanism. However, these methods usually represent temporal context as a fixed-length vector by using an LSTM decoder or using manually designed hidden states to build a recurrent Transformer. Considering a single fixed-length vector is often insufficient to capture long-term temporal context, in this paper, we introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation by modelling the temporal context explicitly. Specifically, MTVM enables the agent to keep track of the navigation trajectory by directly storing previous activations in a memory bank. To further boost the performance, we propose a memory-aware consistency loss to help learn a better joint representation of temporal context with random masked instructions. We evaluate MTVM on popular R2R and CVDN datasets, and our model improves Success Rate on R2R unseen validation and test set by 2% each, and reduce Goal Process by 1.6m on CVDN test set.
翻译:视觉和语言导航(VLN)是一项任务,要求一个代理机构遵循一种语言指令以导航目标位置,而目标位置取决于在移动过程中与环境的持续互动。最近以变异器为基础的VLN方法取得了巨大进展,得益于视觉观测与通过多式联运交叉注意机制的语言教学之间的直接联系。然而,这些方法通常代表一种固定长矢量,使用LSTM解码器或使用人工设计的隐藏状态来构建一个经常性变异器。考虑到单个固定长度矢量往往不足以捕捉长期的时间环境,在本文件中,我们通过对时间环境进行建模,为有视觉背景的自然语言导航引入多式内存(MTVM ) 。具体地说,MTVM使该代理商能够通过直接存储先前的激活在记忆库中进行跟踪导航轨迹。为了进一步提升性,我们提议了一种记忆-觉一致性损失,以便用随机遮蔽的指示来更好地联合描述时间环境。我们用流行的 R2R 和 CVN 1.6 目标测试标准,通过每套R2 R2 和 CVN 测试模型来改进我们的MTVR2 并改进我们的目标测试标准。