Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study focuses on generating precise, in-situ, step-by-step navigation instructions that are practically usable for VI users. Specifically, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to navigation instructions, thereby providing feedback rewards to guide the post-training of a Vision-Language Model (VLM). This enhances instruction accuracy and usability while reducing costly real-world data collection needs. To address the scarcity of dedicated benchmarks in this field, we introduce NIG4VI, a 27k-sample open-source dataset to facilitate training and evaluation. It comprises diverse navigation scenarios with accurate spatial coordinates, supporting detailed and open-ended in-situ instruction generation. Experiments on NIG4VI demonstrate the effectiveness of LaF-GRPO through quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU 14\%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o 0.323), and qualitative analysis further confirms that our method yields more intuitive and safer instructions.
翻译:为视障人士生成导航指令(NIG-VI)至关重要,但相关研究相对不足。本研究专注于生成精确、现场、逐步的导航指令,使其对视障用户具有实际可用性。具体而言,我们提出了LaF-GRPO(LLM作为跟随者的GRPO),其中大型语言模型模拟视障用户对导航指令的响应,从而提供反馈奖励,以指导视觉语言模型的后续训练。这提高了指令的准确性和可用性,同时减少了昂贵的真实世界数据收集需求。针对该领域专用基准稀缺的问题,我们引入了NIG4VI,一个包含27,000个样本的开源数据集,以促进训练和评估。该数据集涵盖多样化的导航场景,并配有精确的空间坐标,支持详细且开放式的现场指令生成。在NIG4VI上的实验通过定量指标(例如,Zero-(LaF-GRPO)将BLEU提升14%;SFT+(LaF-GRPO)的METEOR为0.542,而GPT-4o为0.323)证明了LaF-GRPO的有效性,定性分析进一步证实我们的方法能产生更直观且更安全的指令。