通过推理实现识别：利用大型视觉语言模型强化图像地理定位 (Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models)

Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images. We introduce GLOBE, Group-relative policy optimization for Localizability assessment and Optimized visual-cue reasoning, yielding Bi-objective geo-Enhancement for the VLM in recognition and reasoning. GLOBE incorporates task-specific rewards that jointly enhance localizability assessment, visual-cue reasoning, and geolocation accuracy. Both qualitative and quantitative results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories. The data and code are available at https://github.com/lingli1996/GLOBE.

翻译：先前的地理定位方法通常将该任务视为分类或检索问题，往往依赖于缺乏可解释性的黑盒决策。大型视觉语言模型的兴起使得我们能够将地理定位重新构想为基于视觉线索的推理驱动任务。然而，两大挑战依然存在。在数据方面，现有的以推理为核心的数据集主要基于街景图像，场景多样性有限且视角受限。在建模方面，当前方法主要依赖于监督微调，这对推理能力的提升效果甚微。为应对这些挑战，我们提出了一种新颖的流程，利用多样化的社交媒体图像构建了面向推理的地理定位数据集MP16-Reason。我们引入了GLOBE（面向可定位性评估与优化视觉线索推理的组相对策略优化），为视觉语言模型在识别与推理中实现双目标地理增强。GLOBE整合了特定任务奖励，共同提升可定位性评估、视觉线索推理和地理定位精度。定性与定量结果均表明，在地理定位任务中，GLOBE优于当前最先进的开源大型视觉语言模型，尤其在多样化视觉场景中表现突出，同时能生成更具洞察力和可解释性的推理轨迹。数据与代码公开于https://github.com/lingli1996/GLOBE。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日