Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM's superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https://github.com/david188888/DialogGraph-LLM.
翻译:在多位说话者的长音频对话中识别说话者意图具有广泛的应用前景,但由于说话者话语间复杂的相互依赖性和标注数据稀缺,这是一项非平凡的AI任务。为应对这些挑战,本研究提出了一种端到端框架,即DialogGraph-LLM。该框架将新颖的多关系对话注意力网络(MR-DAN)架构与多模态基础模型(如Qwen2.5-Omni-7B)相结合,实现从声学特征到意图的直接推理。本研究设计了一种基于LLM的自适应半监督学习策略,该策略采用基于全局置信度和类别置信度的双阈值过滤机制实现置信度感知的伪标签生成,并通过基于信息熵的样本选择过程优先选择高信息量的未标注实例。在专有的MarketCalls语料库和公开可用的MIntRec 2.0基准上的大量评估表明,DialogGraph-LLM优于强大的音频和文本驱动基线方法。该框架在真实场景音频对话的意图识别中展现出优异的性能和效率,证明了其在监督有限但音频数据丰富的领域具有实用价值。我们的代码发布于https://github.com/david188888/DialogGraph-LLM。