Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study focuses on generating precise, in-situ, step-by-step navigation instructions that are practically usable for VI users. Specifically, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to navigation instructions, thereby providing feedback rewards to guide the post-training of a Vision-Language Model (VLM). This enhances instruction accuracy and usability while reducing costly real-world data collection needs. To address the scarcity of dedicated benchmarks in this field, we introduce NIG4VI, a 27k-sample open-source dataset to facilitate training and evaluation. It comprises diverse navigation scenarios with accurate spatial coordinates, supporting detailed and open-ended in-situ instruction generation. Experiments on NIG4VI demonstrate the effectiveness of LaF-GRPO through quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU 14\%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o 0.323), and qualitative analysis further confirms that our method yields more intuitive and safer instructions.
翻译:面向视障人士的导航指令生成任务至关重要,但相关研究仍相对不足。本研究聚焦于生成精确、现场、逐步的导航指令,确保其对视障用户具有实际可用性。具体而言,我们提出LaF-GRPO(基于大语言模型作为跟随者的GRPO方法),通过大语言模型模拟视障用户对导航指令的响应,从而提供反馈奖励以指导视觉语言模型的后续训练。该方法在提升指令准确性与可用性的同时,显著降低了对成本高昂的真实世界数据收集的需求。为应对该领域专用基准数据稀缺的问题,我们构建了NIG4VI——一个包含2.7万样本的开源数据集,用于支持训练与评估。该数据集涵盖多样化的导航场景,包含精确的空间坐标,支持生成详细且开放式的现场指令。在NIG4VI数据集上的实验验证了LaF-GRPO的有效性:量化指标显示(例如Zero-(LaF-GRPO)将BLEU分数提升14%;SFT+(LaF-GRPO)的METEOR分数达0.542,优于GPT-4o的0.323),定性分析进一步证实本方法生成的指令更直观且更安全。