ViLaCD-R1：一种用于遥感语义变化检测的视觉-语言框架 (ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing)

Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.

翻译：遥感变化检测（RSCD）是一项复杂的多图像推理任务，传统方法通常采用基于像素的算子或编码器-解码器网络，这些方法难以充分捕获高层语义信息，且易受非语义扰动的干扰。尽管近期基于多模态和视觉语言模型（VLM）的方法通过引入文本描述增强了对变化区域的语义理解，但仍面临空间定位不准确、像素级边界划分不精确以及可解释性有限等挑战。为解决这些问题，我们提出了ViLaCD-R1，这是一个包含多图像推理器（MIR）和掩码引导解码器（MGD）的两阶段框架。具体而言，该VLM通过监督微调（SFT）和强化学习（RL）在块级双时相推理任务上进行训练，以双时相图像块作为输入，输出一个粗略的变化掩码。随后，解码器将双时相图像特征与该粗略掩码相结合，以预测精确的二元变化图。在多个RSCD基准数据集上的综合评估表明，ViLaCD-R1显著提升了真实语义变化的识别与定位能力，有效抑制了非语义变化，并在复杂的真实场景中达到了最先进的准确度。