Large Vision-Language Models (LVLMs) hold great promise for advancing remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only reinforcement learning objective, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Remarkably, the learned prompting policy generalizes zero-shot to multiple referring segmentation benchmarks, exposing a distinct divide between semantic-level and instance-level grounding. We further found that compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds. Together, these findings establish semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation. Our code and model are available at https://github.com/Ricardo-XZ/Think2Seg-RS.
翻译:大型视觉语言模型(LVLMs)在推动遥感(RS)分析方面展现出巨大潜力,然而现有的推理分割框架通过端到端监督微调将语言推理与像素预测耦合在一起,导致几何基础薄弱且跨任务泛化能力有限。为解决这一问题,我们开发了Think2Seg-RS,这是一个解耦框架,通过训练一个LVLM提示器来操控冻结的Segment Anything Model(SAM),并生成结构化几何提示。通过仅使用掩码的强化学习目标,LVLM学会将抽象的语义推理转化为具有空间基础的行动,在EarthReason数据集上实现了最先进的性能。值得注意的是,学习到的提示策略能够零样本泛化至多个参考分割基准,揭示了语义级与实例级基础之间的明显分野。我们进一步发现,在语义级监督下,紧凑型分割器优于大型分割器,且负向提示在异质航空背景下效果不佳。这些发现共同确立了语义级推理分割作为地理空间理解的新范式,为迈向统一、可解释的LVLM驱动地球观测开辟了道路。我们的代码和模型可在https://github.com/Ricardo-XZ/Think2Seg-RS获取。