Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods.
翻译:尽管大型音频-语言模型(LALMs)能够提供最先进的性能,但它们经常遭受幻觉问题,例如生成未基于音频输入的内容。我们分析了这些基础性失败,并识别出一种独特的分类体系:事件遗漏、错误事件身份、时序关系错误和量化时序错误。为解决此问题,我们引入了AHA(音频幻觉对齐)框架。通过利用反事实困难负样本挖掘,我们的流程构建了一个高质量偏好数据集,迫使模型区分严格的声学证据与语言上合理的虚构内容。此外,我们建立了AHA-Eval,一个旨在严格测试这些细粒度时序推理能力的诊断基准。我们应用这些数据对齐Qwen2.5-Omni。所得模型Qwen-Audio-AHA在AHA-Eval上实现了13.7%的性能提升。至关重要的是,这种益处能够推广到我们的诊断集之外。我们的模型在公共基准测试中显示出显著增益,包括在MMAU-Test上提升1.3%,在MMAR上提升1.6%,超越了最新的最先进方法。