Vision and Language Navigation (VLN) requires an agent to navigate to a target location by following natural language instructions. Most of existing works represent a navigation candidate by the feature of the corresponding single view where the candidate lies in. However, an instruction may mention landmarks out of the single view as references, which might lead to failures of textual-visual matching of existing methods. In this work, we propose a multi-module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views for better textual-visual matching. Specifically, our NvEM utilizes a subject module and a reference module to collect contexts from neighbor views. The subject module fuses neighbor views at a global level, and the reference module fuses neighbor objects at a local level. Subjects and references are adaptively determined via attention mechanisms. Our model also includes an action module to utilize the strong orientation guidance (e.g., ``turn left'') in instructions. Each module predicts navigation action separately and their weighted sum is used for predicting the final action. Extensive experimental results demonstrate the effectiveness of the proposed method on the R2R and R4R benchmarks against several state-of-the-art navigators, and NvEM even beats some pre-training ones. Our code is available at https://github.com/MarSaKi/NvEM.
翻译:视觉和语言导航( VLN) 要求一个代理人遵循自然语言指令, 导航到目标位置。 多数现有作品代表了导航候选者, 其特征是候选人所在的对应单一视图的特性。 但是, 指令可能提到单个视图中的里程碑作为参考, 可能导致现有方法的文本- 视觉匹配失败。 在这项工作中, 我们提议了一个多模块 Neighbor- View 强化模型( NvEM), 以适应的方式将邻居观点的视觉环境纳入其中, 以便更好的文本- 视觉匹配。 具体地说, 我们的NvEM 模型使用一个主题模块和参考模块, 收集邻居观点的背景。 主题模块将全球的邻居观点连接起来, 引用模块将相邻对象连接到本地一级的物体。 主题和引用通过关注机制来适应性地决定。 我们的模型还包括一个行动模块, 以在指令中使用强有力的方向指导( 例如, " 左转 " ) 。 每个模块分别预测导航行动, 其加权总数用于预测最后行动。 广泛的实验结果显示在R2R 和 R4R 和 R4R 培训前 我们的系统/ R4Ra- train 中的一些 和 Rav- train 和 Rav- train 某些现有标准是某些标准的有效性。