Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. Previous studies propose data augmentation methods to mitigate the data bias explicitly or implicitly and provide improvements in generalization. However, they try to memorize augmented trajectories and ignore the distribution shifts under unseen environments at test time. In this paper, we propose an Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency. Specifically, we devise: 1) a semi-supervised framework DAVIS that leverages visual consistency signals across similar semantic observations. 2) a two-stage learning procedure that encourages adaptation to test-time distribution. The framework enhances the basic mixture of imitation and reinforcement learning with Momentum Contrast to encourage stable decision-making on similar observations under a joint training stage and a test-time adaptation stage. Extensive experiments show that DAVIS achieves model-agnostic improvement over previous state-of-the-art VLN baselines on R2R and RxR benchmarks. Our source code and data are in supplemental materials.
翻译:视觉语言导航要求代理人遵守自然语言指示,以达到特定的目标。视觉和看不见环境之间的巨大差异使得代理人难以全面概括。以前的研究提出了数据增强方法,以明示或暗示地减少数据偏差,并改进一般化。不过,它们试图将扩大的轨迹进行记忆,忽视试验时在看不见环境中的分布变化。在本文中,我们提议采用一种不见相异的预测愿景和语言导航(DAVIS),通过鼓励测试-时间视觉一致性,学习向看不见环境推广。具体地说,我们设计了:1)半监督的DAVIS框架,利用类似语义观测的视觉一致性信号。2)一个两阶段学习程序,鼓励适应试验-时间分布。这个框架加强了模拟和强化学习的基本结合,鼓励在联合培训阶段和试验-时间适应阶段对类似观测进行稳定决策。广泛的实验显示DAVIS在先前的状态-时间视觉一致性测试-时间一致性方面实现了模型改进。R2和我们数据库中的数据基础是R2和R-R的S补充源。