扩展音频上下文以实现大型音频-语言模型的长篇理解 (Extending Audio Context for Long-Form Understanding in Large Audio-Language Models)

Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.

翻译：大型音频-语言模型通常受限于较短的音频上下文窗口，即使其文本主干支持长上下文，这限制了对长篇音频的理解。先前的研究已在单模态大语言模型上引入了上下文扩展方法（如YaRN），但其在音频-语言模型中的应用尚未探索。首先，基于RoPE的上下文扩展，我们提出了Partial YaRN，这是一种无需训练、仅针对音频的扩展方法，它仅修改音频标记的位置，而保持文本位置不变，以保留基础大语言模型的文本能力。其次，我们提出了虚拟长篇音频训练，这是一种训练策略，将Partial YaRN扩展为训练时的位置增强方法。VLAT在训练期间模拟不同长度的音频，使其能够泛化到远超训练所见长度的输入，并提高长上下文音频理解的鲁棒性。我们在SALMONN和Qwen2-Audio上的实验表明，Partial YaRN在多种设置下均优于原始模型，且VLAT训练策略带来了显著改进，在未见长度的长音频上实现了强劲性能。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日