Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to static observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: \url{https://github.com/qizhust/esceme}.}
翻译:视觉和语言导航( VLN) 模拟了一个视觉代理器, 它在现实世界的场景中遵循自然语言导航指令; 现有方法在新环境中的导航取得了巨大进步, 如光束搜索、 探索前、 动态或等级历史编码等。 为了平衡一般化和效率, 我们使用比当前导航路线更常用的对访问情景进行回映。 在这项工作中, 我们为 VLN 引入了一个机制, 它可以唤醒代理人对过去访问的记忆。 外观场景记忆使得代理人可以想象出下一个预测的更大图景。 这样, 代理人学会使用动态更新的信息, 而不是仅仅适应静态的观测。 我们通过在每个地点加强无障碍的视角, 并逐渐完成导航时的记忆, 提供简单而有效的ESceme 。 我们验证了短程风座( R2R)、 长镜( R4R ) 和视觉- dialog( CVDNN) 头号: 我们的ESceus\us swinstweb/ coard.</s>