Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
翻译:数据稀缺是视觉语言导航领域长期存在的挑战,严重阻碍了智能体在未见环境中的泛化能力。先前的研究主要依赖额外的模拟器数据或网络收集的图像/视频来提升泛化性能。然而,模拟器环境仍面临多样性有限的问题,而网络收集的数据通常需要大量人工劳动以去除噪声。本文提出一种用于视觉语言导航的重写驱动增强范式,通过重写人工标注的训练数据直接创建未见过的观察-指令对。得益于我们的重写机制,新观察-指令对可在无需模拟器和节省人工的方式下获得,从而促进泛化。具体而言,我们首先引入对象增强的观察重写,结合视觉语言模型和大语言模型推导出重写的对象增强场景描述,通过文本到图像生成模型实现具有多样化对象和空间布局的观察合成。随后,我们提出观察对比指令重写,通过要求大语言模型推理原始观察与新观察之间的差异,生成与观察对齐的重写指令。我们进一步开发了混合后聚焦的训练策略与随机观察裁剪方案,在训练过程中有效增强数据分布多样性的同时抑制增强数据噪声。在离散环境(R2R、REVERIE和R4R数据集)和连续环境(R2R-CE数据集)上的实验表明,我们的方法具有优越的性能和令人印象深刻的泛化能力。代码发布于https://github.com/SaDil13/VLN-RAM。