The vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction. Previous works learn to navigate step-by-step following an instruction. However, these works may fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions. These problems hinder agents from learning distinctive vision-and-language representations, harming the robustness and generalizability of the navigation policy. In this paper, we propose a Contrastive Instruction-Trajectory Learning (CITL) framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation. Specifically, we propose: (1) a coarse-grained contrastive learning objective to enhance vision-and-language representations by contrasting semantics of full trajectory observations and instructions, respectively; (2) a fine-grained contrastive learning objective to perceive instructions by leveraging the temporal information of the sub-instructions; (3) a pairwise sample-reweighting mechanism for contrastive learning to mine hard samples and hence mitigate the influence of data sampling bias in contrastive learning. Our CITL can be easily integrated with VLN backbones to form a new learning paradigm and achieve better generalizability in unseen environments. Extensive experiments show that the model with CITL surpasses the previous state-of-the-art methods on R2R, R4R, and RxR.
翻译:视觉语言导航(VLN)任务要求一种代理人在自然语言教学的指导下达到目标。以前的作品学会在教学后一步步逐步导航。但是,这些作品可能无法区分各教学轨对的异异和差异,忽视次教学的时间连续性。这些问题妨碍代理人学习独特的视觉和语言表现方式,损害导航政策的稳健性和可概括性。在本文中,我们提议了一个对比教学-方向学习(CITL)框架,以探索类似数据样本之间的差异和不同数据样本的差异,以了解强力导航的显著表现。具体地说,我们提议:(1) 一个粗略的对比学习目标,通过分别对比整个轨迹观察和指示的语义来强化视觉和语言的表现。(2) 一个细微对比学习目标,以利用子教学模式的时际信息来了解指导;(3) 一个对等学习地雷的硬样品和不同数据差异的对比性机制,从而减轻数据取样的明显差异性表现。我们提议:(1) 一个粗略对比的对比性对比性学习目标,在比较性C-R级实验中,以更清晰的基底学习方式更好地学习。