While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.
翻译:尽管多模态大语言模型已在诸多视觉与文本推理任务中达到类人水平,但其在精细空间理解(如地图路径追踪)方面的能力仍显不足。与人类能够快速解析并导航地图不同,现有模型往往难以遵循基本路径约束,部分原因在于获取大规模像素级精确路径标注的成本极高且难度巨大。为此,我们提出一种可扩展的合成数据生成流程,通过合成地图图像与像素级解析技术,自动为这一挑战性任务生成精确标注。基于该流程,我们构建了包含4千张地图中2.3万条路径样本的微调数据集,使模型能够获得更接近人类的空间认知能力。利用该数据集,我们对开源与专有多模态大语言模型进行微调。MapBench上的实验结果表明:微调显著提升了模型鲁棒性,成功率最高提升6.4个百分点,同时降低了路径追踪误差(NDTW)。这些进展表明,预训练模型所缺失的精细空间推理能力,能够通过合成监督数据得到有效训练。