Large Vision-Language Models (LVLMs) have shown remarkable capabilities, yet hallucinations remain a persistent challenge. This work presents a systematic analysis of the internal evolution of visual perception and token generation in LVLMs, revealing two key patterns. First, perception follows a three-stage GATE process: early layers perform a Global scan, intermediate layers Approach and Tighten on core content, and later layers Explore supplementary regions. Second, generation exhibits an SAD (Subdominant Accumulation to Dominant) pattern, where hallucinated tokens arise from the repeated accumulation of subdominant tokens lacking support from attention (visual perception) or feed-forward network (internal knowledge). Guided by these findings, we devise the VDC (Validated Dominance Correction) strategy, which detects unsupported tokens and replaces them with validated dominant ones to improve output reliability. Extensive experiments across multiple models and benchmarks confirm that VDC substantially mitigates hallucinations.
翻译:大型视觉语言模型(LVLM)已展现出卓越能力,但幻觉问题仍是持续存在的挑战。本研究对LVLM内部视觉感知与词元生成的演化过程进行了系统性分析,揭示出两个关键模式。首先,感知遵循三阶段GATE过程:早期层执行全局扫描,中间层对核心内容进行逼近与聚焦,后期层探索补充区域。其次,生成呈现SAD(次主导累积至主导)模式,其中幻觉词元源于缺乏注意力(视觉感知)或前馈网络(内部知识)支持的次主导词元的重复累积。基于这些发现,我们设计了VDC(验证主导校正)策略,通过检测无支持词元并将其替换为经验证的主导词元来提升输出可靠性。在多个模型与基准测试上的广泛实验证实,VDC能显著缓解幻觉现象。