Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training data with explicit reasoning chains and the absence of mechanisms for active exploration and iterative refinement. To address these challenges, we propose AudioGenie-Reasoner (AGR), the first unified training-free multi-agent system that coordinates perception and reasoning over an evolving chain of textual evidence. Our key idea is a paradigm shift that transforms audio deep reasoning into complex text understanding task from a new perspective, thereby unlocking the full potential of large language models. Specifically, the design of AGR mimics the human coarse-to-fine cognitive process. It first transforms the input audio into a coarse text-based document. Then, we design a novel proactive iterative document refinement loop, featuring tool-augmented routes and specialized agents, to continuously search for missing information and augment the evidence chain in a coarse-to-fine manner until sufficient question-related information is gathered for making final predictions. Experimental results show that AGR achieves state-of-the-art (SOTA) performance over existing open-source audio deep reasoning models across various benchmarks. The code will be available at https://github.com/ryysayhi/AudioGenie-Reasoner.
翻译:音频深度推理是一项具有挑战性的任务,需要专家级的感知能力、多步骤的逻辑推理以及上下文知识的整合。然而,由于缺乏带有显式推理链的训练数据,以及缺少主动探索和迭代优化的机制,现有模型在音频感知与推理能力之间存在差距。为应对这些挑战,我们提出了AudioGenie-Reasoner(AGR),这是首个统一的无训练多智能体系统,它在一个不断演化的文本证据链上协调感知与推理。我们的核心思想是一种范式转变,即从一个新的视角将音频深度推理转化为复杂的文本理解任务,从而释放大型语言模型的全部潜力。具体而言,AGR的设计模仿了人类从粗到细的认知过程。它首先将输入音频转换为一个粗略的基于文本的文档。然后,我们设计了一个新颖的主动迭代文档优化循环,该循环包含工具增强路径和专用智能体,以持续搜索缺失信息,并以从粗到细的方式不断扩充证据链,直到收集到足够与问题相关的信息以做出最终预测。实验结果表明,AGR在各种基准测试中均超越了现有的开源音频深度推理模型,达到了最先进的性能。代码将在 https://github.com/ryysayhi/AudioGenie-Reasoner 上提供。