MRFD：基于多区域融合解码与自一致性的视觉语言大模型幻觉缓解方法 (MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs)

Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.

翻译：视觉语言大模型（LVLMs）在多模态任务中展现出强大性能。然而，由于模型难以验证图像不同区域信息的一致性，其生成的文本常出现与视觉输入不符的幻觉现象。为解决该问题，本文提出多区域融合解码方法（MRFD），这是一种无需训练的解码策略，通过建模区域间一致性来增强事实依据。MRFD利用交叉注意力机制识别显著区域，为每个区域生成初始响应，并基于响应间的Jensen-Shannon散度（JSD）计算可靠性权重。这些权重通过受思维链推理启发的区域感知提示，指导各区域预测结果进行一致性感知融合。在多个LVLM模型和基准测试上的实验表明，MRFD能显著减少幻觉现象并提升响应的事实准确性，且无需更新模型参数。