ETP-R1：基于强化微调演化拓扑规划的连续环境视觉语言导航 (ETP-R1: Evolving Topological Planning with Reinforcement Fine-tuning for Vision-Language Navigation in Continuous Environments)

Vision-Language Navigation in Continuous Environments (VLN-CE) requires an embodied agent to navigate towards target in continuous environments, following natural language instructions. While current graph-based methods offer an efficient, structured approach by abstracting the environment into a topological map and simplifying the action space to waypoint selection, they lag behind methods based on Large Vision-Language Models (LVLMs) in leveraging large-scale data and advanced training paradigms. In this paper, we try to bridge this gap by introducing ETP-R1, a framework that applies the paradigm of scaling up data and Reinforcement Fine-Tuning (RFT) to a graph-based VLN-CE model. To build a strong foundation, we first construct a high-quality, large-scale pretraining dataset using the Gemini API. This dataset consists of diverse, low-hallucination instructions for topological trajectories, providing rich supervision for our graph-based policy to map language to topological paths. This foundation is further strengthened by unifying data from both R2R and RxR tasks for joint pretraining. Building on this, we introduce a three-stage training paradigm, which culminates in the first application of closed-loop, online RFT to a graph-based VLN-CE model, powered by the Group Relative Policy Optimization (GRPO) algorithm. Extensive experiments demonstrate that our approach is highly effective, establishing new state-of-the-art performance across all major metrics on both the R2R-CE and RxR-CE benchmarks. Our code is available at https://github.com/Cepillar/ETP-R1.

翻译：连续环境中的视觉语言导航要求具身智能体在连续环境中遵循自然语言指令导航至目标。当前基于图结构的方法通过将环境抽象为拓扑图并将动作空间简化为路径点选择，提供了一种高效的结构化方案，但在利用大规模数据和先进训练范式方面落后于基于大型视觉语言模型的方法。本文试图通过引入ETP-R1框架来弥合这一差距，该框架将数据规模化与强化微调范式应用于基于图的连续环境视觉语言导航模型。为建立坚实基础，我们首先利用Gemini API构建了高质量、大规模的预训练数据集。该数据集包含针对拓扑轨迹的多样化、低幻觉指令，为基于图的策略将语言映射到拓扑路径提供了丰富的监督信号。通过整合R2R和RxR任务数据进行联合预训练，进一步强化了这一基础。在此基础上，我们提出了三阶段训练范式，最终首次将基于Group Relative Policy优化算法的闭环在线强化微调应用于基于图的连续环境视觉语言导航模型。大量实验证明，我们的方法具有显著有效性，在R2R-CE和RxR-CE基准测试的所有主要指标上均实现了新的最先进性能。代码已开源：https://github.com/Cepillar/ETP-R1。