改善视觉与语言导航：生成未来视图图像语义 (Improving Vision-and-Language Navigation by Generating Future-View Image Semantics)

Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions. At each step, the agent takes the next action by selecting from a set of navigable locations. In this paper, we aim to take one step further and explore whether the agent can benefit from generating the potential future view during navigation. Intuitively, humans will have an expectation of how the future environment will look like, based on the natural language instructions and surrounding views, which will aid correct navigation. Hence, to equip the agent with this ability to generate the semantics of future navigation views, we first propose three proxy tasks during the agent's in-domain pre-training: Masked Panorama Modeling (MPM), Masked Trajectory Modeling (MTM), and Action Prediction with Image Generation (APIG). These three objectives teach the model to predict missing views in a panorama (MPM), predict missing steps in the full trajectory (MTM), and generate the next view based on the full instruction and navigation history (APIG), respectively. We then fine-tune the agent on the VLN task with an auxiliary loss that minimizes the difference between the view semantics generated by the agent and the ground truth view semantics of the next step. Empirically, our VLN-SIG achieves the new state-of-the-art on both the Room-to-Room dataset and the CVDN dataset. We further show that our agent learns to fill in missing patches in future views qualitatively, which brings more interpretability over agents' predicted actions. Lastly, we demonstrate that learning to predict future view semantics also enables the agent to have better performance on longer paths.

翻译：视觉与语言导航（VLN）是一项要求代理根据自然语言指令在环境中导航的任务。在每一步中，代理通过从可驶动的位置集合中进行选择来采取下一步动作。在本文中，我们旨在更进一步地探索代理人是否可以从生成导航过程中的潜在未来视图中受益。直觉上，人类会根据自然语言指令和周围的视图对未来环境会有一个期望，这将有助于正确的导航。因此，为了赋予代理人生成未来导航视图语义的能力，我们首先在代理人的域内预训练过程中提出了三个代理任务：遮盖全景建模（Masked Panorama Modeling，MPM）、遮盖轨迹建模（Masked Trajectory Modeling，MTM）和基于图像生成的动作预测（Action Prediction with Image Generation，APIG）。这三个目标分别教授模型预测全景中的缺失视图（MPM）、预测完整轨迹中的缺失步骤（MTM）和根据完整指令和导航历史生成下一个视图（APIG）。然后，我们对VLN任务进行微调，使用辅助损失最小化代理生成的视图语义与下一步的真实视图语义之间的差异。实验表明，我们的VLN-SIG在Room-to-Room 数据集和 CVDN 数据集上实现了最新的最高水平。我们进一步展示，代理已经学会了在未来视图中 qualitatively 填补缺失的补丁，这带来更多关于代理人预测行为的可解释性。最后，我们展示学习预测未来视图语义的能力还使代理人在较长路径上拥有更好的性能。