Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.
翻译:大型音频-语言模型通常受限于较短的音频上下文窗口,即使其文本主干支持长上下文,这限制了对长篇音频的理解。先前的研究已在单模态大语言模型上引入了上下文扩展方法(如YaRN),但其在音频-语言模型中的应用尚未探索。首先,基于RoPE的上下文扩展,我们提出了Partial YaRN,这是一种无需训练、仅针对音频的扩展方法,它仅修改音频标记的位置,而保持文本位置不变,以保留基础大语言模型的文本能力。其次,我们提出了虚拟长篇音频训练,这是一种训练策略,将Partial YaRN扩展为训练时的位置增强方法。VLAT在训练期间模拟不同长度的音频,使其能够泛化到远超训练所见长度的输入,并提高长上下文音频理解的鲁棒性。我们在SALMONN和Qwen2-Audio上的实验表明,Partial YaRN在多种设置下均优于原始模型,且VLAT训练策略带来了显著改进,在未见长度的长音频上实现了强劲性能。