Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.
翻译:视觉语言大模型(LVLMs)在多模态任务中展现出强大性能。然而,由于模型难以验证图像不同区域信息的一致性,其生成的文本常出现与视觉输入不符的幻觉现象。为解决该问题,本文提出多区域融合解码方法(MRFD),这是一种无需训练的解码策略,通过建模区域间一致性来增强事实依据。MRFD利用交叉注意力机制识别显著区域,为每个区域生成初始响应,并基于响应间的Jensen-Shannon散度(JSD)计算可靠性权重。这些权重通过受思维链推理启发的区域感知提示,指导各区域预测结果进行一致性感知融合。在多个LVLM模型和基准测试上的实验表明,MRFD能显著减少幻觉现象并提升响应的事实准确性,且无需更新模型参数。