Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.
翻译:变化检测视觉问答(CDVQA)要求通过推理双时相遥感图像中的语义变化来回答文本查询。一种直接的方法是借助通用视觉-语言模型,通过监督微调(SFT)来提升CDVQA性能。尽管近期取得进展,我们观察到相当一部分失败并非源于明显错误的预测,而是源于决策模糊性——即模型对正确答案与强干扰项赋予相近的置信度。为形式化这一挑战,我们定义决策模糊样本(DAS)为正确答案与最具竞争力替代项之间概率差值较小的实例。我们认为显式优化DAS对于提升CDVQA模型的判别力与鲁棒性至关重要。为此,我们提出DARFT,一种决策模糊性引导的强化微调框架:该框架首先利用SFT训练的参考策略挖掘DAS,随后在挖掘出的子集上应用组间相对策略优化。通过利用多样本解码与组内相对优势,DARFT能够在无需额外监督的情况下抑制强干扰项并锐化决策边界。大量实验表明,该方法在SFT基线上取得了持续的性能提升,尤其在小样本场景下表现显著。