Spoken Language Models (SLMs) are increasingly central to modern speech-driven applications, but performance degrades under acoustic shift - real-world noise, reverberation, and microphone variation. Prior solutions rely on offline domain adaptation, which is post-hoc, data-intensive, and slow. We introduce the first test-time adaptation (TTA) framework for generative SLMs that process interleaved audio-text prompts. Our method updates a small, targeted subset of parameters during inference using only the incoming utterance, requiring no source data or labels. This stabilizes token distributions and improves robustness to acoustic variability without degrading core task accuracy. Evaluated on automatic speech recognition, speech translation, and 19 audio understanding tasks from AIR-Bench, our approach yields consistent gains under diverse corruptions. Because adaptation touches only a small fraction of weights, it is both compute- and memory-efficient, supporting deployment on resource-constrained platforms. This work enhances the robustness and adaptability of generative SLMs for real-world speech-driven applications.
翻译:口语语言模型(SLM)在现代语音驱动应用中日益重要,但其性能在声学偏移(如真实环境中的噪声、混响和麦克风变化)下会下降。现有解决方案依赖于离线域自适应,这种方法具有事后性、数据密集且速度缓慢。我们首次提出了一个用于处理交错音频-文本提示的生成式SLM的测试时自适应(TTA)框架。我们的方法在推理过程中仅使用输入的话语,更新一个小的、有针对性的参数子集,无需源数据或标签。这稳定了令牌分布,并提高了对声学变化的鲁棒性,同时不降低核心任务精度。在自动语音识别、语音翻译以及AIR-Bench的19项音频理解任务上的评估表明,我们的方法在各种失真条件下均能带来一致的性能提升。由于自适应仅涉及一小部分权重,该方法在计算和内存方面均高效,支持在资源受限的平台上部署。这项工作增强了生成式SLM在现实世界语音驱动应用中的鲁棒性和适应性。