Large vision-language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (prelim) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.
翻译:大规模视觉-语言模型(LVLMs)虽功能强大,但仍因物体幻觉问题而存在不可靠性。本研究发现,在许多幻觉预测中,LVLM实际上忽略了图像信息,转而依赖先前生成的输出(前导)标记来推断新物体。我们通过计算图像与在给定前导标记条件下预测物体之间的互信息来量化这一行为,证明图像依赖性弱与幻觉现象高度相关。基于此发现,我们提出了前导注意力分数(PAS),这是一种从前导标记的注意力权重中计算得到的轻量级、无需训练的信号。PAS无需额外前向计算,可在推理过程中实时生成。利用这一先前被忽视的信号,PAS在多个模型与数据集上实现了最先进的物体幻觉检测性能,为实时过滤与干预提供了可能。