Large Audio-Language Models (LALMs) are becoming essential as a powerful multimodal backbone for real-world applications. However, recent studies show that audio inputs can more easily elicit harmful responses than text, exposing new risks toward deployment. While safety alignment has made initial advances in LLMs and Large Vision-Language Models (LVLMs), we find that vanilla adaptation of these approaches to LALMs faces two key limitations: 1) LLM-based steering fails under audio input due to the large distributional gap between activations, and 2) prompt-based defenses induce over-refusals on benign-speech queries. To address these challenges, we propose Safe-Ablated Refusal Steering (SARSteer), the first inference-time defense framework for LALMs. Specifically, SARSteer leverages text-derived refusal steering to enforce rejection without manipulating audio inputs and introduces decomposed safe-space ablation to mitigate over-refusal. Extensive experiments demonstrate that SARSteer significantly improves harmful-query refusal while preserving benign responses, establishing a principled step toward safety alignment in LALMs.
翻译:大型音频语言模型(LALMs)正成为现实应用中强大的多模态骨干。然而,近期研究表明,相较于文本,音频输入更容易引发有害响应,从而暴露出新的部署风险。尽管安全对齐技术已在大型语言模型(LLMs)和大型视觉语言模型(LVLMs)中取得初步进展,但我们发现将这些方法直接迁移至LALMs面临两个关键局限:1)基于LLM的导向机制因激活分布差异过大而在音频输入下失效;2)基于提示的防御机制会在良性语音查询上引发过度拒绝。为应对这些挑战,我们提出了安全消融拒绝导向(SARSteer),这是首个面向LALMs的推理时防御框架。具体而言,SARSteer利用文本衍生的拒绝导向机制实现拒绝响应而无需操控音频输入,并引入解构化安全空间消融以缓解过度拒绝问题。大量实验表明,SARSteer在保持良性响应能力的同时,显著提升了对有害查询的拒绝效能,为LALMs的安全对齐奠定了理论基础。