Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model's dependence on language priors but also enhances its visual performance. (Code: https://github.com/rickeyhhh/Hallucination-Disentangled-Decoding)
翻译:大型视觉语言模型(LVLMs)弥合了视觉与语言模态之间的鸿沟,在多个领域展现出巨大潜力。然而,尽管取得了显著进展,LVLMs在物体识别任务中仍存在严重的幻觉问题。这些模型往往无法准确识别特定物体,导致生成的文本虽然流畅却与视觉内容不符,这在现实应用中可能引发严重后果。近期已有多种方法被提出以缓解LVLM幻觉,但大多仅关注减少语言模态中的幻觉。为同时缓解语言与视觉模态中的幻觉,我们提出了一种无需训练的幻觉解耦解码(HDD)方法。HDD通过对原始图像进行分割并选择增强图像来优化原始输入,同时利用空白图像消除原始图像及分割图像中的语言先验幻觉。该设计不仅降低了模型对语言先验的依赖,还提升了其视觉处理性能。(代码:https://github.com/rickeyhhh/Hallucination-Disentangled-Decoding)