Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images. We introduce GLOBE, Group-relative policy optimization for Localizability assessment and Optimized visual-cue reasoning, yielding Bi-objective geo-Enhancement for the VLM in recognition and reasoning. GLOBE incorporates task-specific rewards that jointly enhance localizability assessment, visual-cue reasoning, and geolocation accuracy. Both qualitative and quantitative results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories. The data and code are available at https://github.com/lingli1996/GLOBE.
翻译:先前的地理定位方法通常将该任务视为分类或检索问题,往往依赖于缺乏可解释性的黑盒决策。大型视觉语言模型的兴起使得我们能够将地理定位重新构想为基于视觉线索的推理驱动任务。然而,两大挑战依然存在。在数据方面,现有的以推理为核心的数据集主要基于街景图像,场景多样性有限且视角受限。在建模方面,当前方法主要依赖于监督微调,这对推理能力的提升效果甚微。为应对这些挑战,我们提出了一种新颖的流程,利用多样化的社交媒体图像构建了面向推理的地理定位数据集MP16-Reason。我们引入了GLOBE(面向可定位性评估与优化视觉线索推理的组相对策略优化),为视觉语言模型在识别与推理中实现双目标地理增强。GLOBE整合了特定任务奖励,共同提升可定位性评估、视觉线索推理和地理定位精度。定性与定量结果均表明,在地理定位任务中,GLOBE优于当前最先进的开源大型视觉语言模型,尤其在多样化视觉场景中表现突出,同时能生成更具洞察力和可解释性的推理轨迹。数据与代码公开于https://github.com/lingli1996/GLOBE。