Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.
翻译:多模态大语言模型(MLLMs)在视频理解方面取得了显著进展。然而,它们存在一个关键弱点:过度依赖语言先验,这可能导致视觉信息失真的幻觉,尤其是在处理违背常识的反事实视频时。这一局限性源于文本与视频之间固有的数据不平衡,且由于收集和标注反事实数据成本高昂而难以解决。为此,我们提出了DualityForge,一种新颖的反事实数据合成框架,它利用可控的、基于扩散的视频编辑技术将真实世界视频转化为反事实场景。通过将结构化上下文信息嵌入视频编辑和问答生成过程,该框架能够自动生成高质量的问答对以及用于对比训练的原始-编辑视频对。基于此,我们构建了DualityVidQA,一个旨在减少MLLM幻觉的大规模视频数据集。此外,为了充分利用我们配对数据的对比特性,我们提出了Duality-Normalized Advantage Training(DNA-Train),一种两阶段的SFT-RL训练机制,其中强化学习阶段应用了配对$\ell_1$优势归一化,从而实现更稳定、更高效的策略优化。在DualityVidQA-Test上的实验表明,我们的方法显著减少了模型在反事实视频上的幻觉,相比Qwen2.5-VL-7B基线实现了24.0%的相对提升。此外,我们的方法在幻觉和通用基准测试上均取得了显著增益,显示出强大的泛化能力。我们将开源我们的数据集和代码。