Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.
翻译:遥感变化检测(RSCD)是一项复杂的多图像推理任务,传统方法通常采用基于像素的算子或编码器-解码器网络,这些方法难以充分捕获高层语义信息,且易受非语义扰动的干扰。尽管近期基于多模态和视觉语言模型(VLM)的方法通过引入文本描述增强了对变化区域的语义理解,但仍面临空间定位不准确、像素级边界划分不精确以及可解释性有限等挑战。为解决这些问题,我们提出了ViLaCD-R1,这是一个包含多图像推理器(MIR)和掩码引导解码器(MGD)的两阶段框架。具体而言,该VLM通过监督微调(SFT)和强化学习(RL)在块级双时相推理任务上进行训练,以双时相图像块作为输入,输出一个粗略的变化掩码。随后,解码器将双时相图像特征与该粗略掩码相结合,以预测精确的二元变化图。在多个RSCD基准数据集上的综合评估表明,ViLaCD-R1显著提升了真实语义变化的识别与定位能力,有效抑制了非语义变化,并在复杂的真实场景中达到了最先进的准确度。