Recently, large vision-language models (LVLMs) have risen to be a promising approach for multimodal tasks. However, principled hallucination mitigation remains a critical challenge.In this work, we first analyze the data generation process in LVLM hallucination mitigation and affirm that on-policy data significantly outperforms off-policy data, which thus calls for efficient and reliable preference annotation of on-policy data. We then point out that, existing annotation methods introduce additional hallucination in training samples, which may enhance the model's hallucination patterns, to address this problem, we propose training a hallucination classifier giving binary annotations, which guarantee clean chosen samples for the subsequent alignment. To further harness of the power of on-policy data, we design a robust iterative direct preference optimization (DPO) algorithm adopting a dynamic sample reweighting scheme. We conduct comprehensive experiments on three benchmarks with comparison to 8 state-of-the-art baselines. In particular, our approach reduces the hallucination rate of LLaVA-1.5-7B on MMHalBench by 50.8% and the average hallucination rate on Object HalBench by 79.5%; more significantly, our method fully taps into the potential of open-source models, enabling LLaVA-1.5-13B to even surpass the performance of GPT-4V.
翻译:近年来,大型视觉语言模型(LVLMs)已成为处理多模态任务的一种前景广阔的方法。然而,系统性的幻觉抑制仍然是一个关键挑战。在本研究中,我们首先分析了LVLM幻觉抑制中的数据生成过程,并确认在线策略数据显著优于离线策略数据,这凸显了对在线策略数据进行高效可靠偏好标注的需求。随后,我们指出现有标注方法会在训练样本中引入额外的幻觉,这可能强化模型的幻觉模式;为解决此问题,我们提出训练一个提供二元标注的幻觉分类器,以确保为后续对齐过程提供纯净的选定样本。为充分发挥在线策略数据的潜力,我们设计了一种鲁棒的迭代直接偏好优化(DPO)算法,采用动态样本重加权方案。我们在三个基准测试上进行了全面实验,并与8个先进基线方法进行了比较。具体而言,我们的方法将LLaVA-1.5-7B在MMHalBench上的幻觉率降低了50.8%,在Object HalBench上的平均幻觉率降低了79.5%;更重要的是,我们的方法充分挖掘了开源模型的潜力,使LLaVA-1.5-13B的性能甚至超越了GPT-4V。