Large-scale vision-language pre-trained (VLP) models are prone to hallucinate non-existent visual objects when generating text based on visual information. In this paper, we systematically study the object hallucination problem from three aspects. First, we examine recent state-of-the-art VLP models, showing that they still hallucinate frequently, and models achieving better scores on standard metrics (e.g., CIDEr) could be more unfaithful. Second, we investigate how different types of image encoding in VLP influence hallucination, including region-based, grid-based, and patch-based. Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination. Third, we decouple various VLP objectives and demonstrate that token-level image-text alignment and controlled generation are crucial to reducing hallucination. Based on that, we propose a simple yet effective VLP loss named ObjMLM to further mitigate object hallucination. Results show that it reduces object hallucination by up to 17.4% when tested on two benchmarks (COCO Caption for in-domain and NoCaps for out-of-domain evaluation).
翻译:大型视觉语言预科( VLP) 模型在根据视觉信息生成文本时很容易产生幻觉。 在本文中,我们系统地从三个方面研究对象幻觉问题。 首先,我们研究最新的艺术VLP模型,显示它们仍然经常产生幻觉,在标准指标(如CIDER)上取得更好的分数的模式(如CIDER)可能更加不忠。 其次,我们调查VLP中不同类型的图像编码对幻觉的影响,包括基于区域的、基于网格的和基于补丁的幻觉。 令人惊讶的是,我们发现基于补丁的功能能够产生最佳和较小的补丁解答,在目标幻觉中产生非微量的减少。 第三,我们将各种VLP目标脱钩,并表明象征性的图像文本校正和受控制的生成对于减少幻觉至关重要。 在此基础上,我们提议了名为ObjMLM( ObjMM) 的简单而有效的VLP损失,以进一步减轻对象幻觉。 结果表明, 在对两个基准( COCaption) 和NoCapmain 进行测试时, 它将目标幻觉减少17.4%。