Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).
翻译:分割视觉语言模型(VLMs)在基于视觉的语义理解方面取得了显著进展,但仍易出现像素定位幻觉,即对错误对象或完全不存在对象生成掩码。现有评估方法几乎完全依赖基于文本或标签的扰动,仅检查预测掩码是否与查询标签匹配。此类评估忽略了幻觉的空间范围与严重程度,因此未能揭示更具挑战性且更普遍的视觉驱动幻觉。为填补这一空白,我们正式提出反事实分割推理(CSR)任务,要求模型在事实图像中分割被指涉对象,并在其反事实对应图像中保持克制。为支持该任务,我们构建了HalluSegBench——首个基于可控视觉反事实的大规模基准测试,用于诊断指涉与推理表达式的分割幻觉,同时引入新的评估指标以量化幻觉严重程度,并区分视觉驱动与语言驱动的失效模式。我们进一步提出RobustSeg,这是一种通过反事实微调(CFT)训练的分割VLM,旨在学习何时分割与何时克制。实验结果表明,RobustSeg将幻觉减少30%,同时在FP-RefCOCO(+/g)数据集上提升了分割性能。