Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
翻译:大型视觉语言模型(LVLM)容易产生幻觉现象,即生成的回答在语义上看似合理,但与输入图像的相关性很小或没有。先前的研究表明,这一问题主要源于LVLM在解码过程中过度依赖语言先验而忽视视觉信息。为缓解此问题,我们提出了一种新颖的条件逐点互信息(C-PMI)校准解码策略,该策略自适应地增强生成文本与输入图像之间的相互依赖性,从而减少幻觉。与现有方法仅关注文本标记采样不同,我们提出联合建模视觉和文本标记对C-PMI的贡献,将幻觉缓解问题表述为旨在最大化互信息的双层优化问题。为解决此问题,我们设计了一种标记净化机制,该机制通过采样与给定图像保持最大相关性的文本标记来动态调节解码过程,同时优化与生成回答最相关的图像标记。在多个基准测试上的大量实验表明,所提方法能显著减少LVLM中的幻觉,同时保持解码效率。