Mixture-of-Experts (MoE) architectures have advanced the scaling of Large Language Models (LLMs) by activating only a sparse subset of parameters per input, enabling state-of-the-art performance with reduced computational cost. As these models are increasingly deployed in critical domains, understanding and strengthening their alignment mechanisms is essential to prevent harmful outputs. However, existing LLM safety research has focused almost exclusively on dense architectures, leaving the unique safety properties of MoEs largely unexamined. The modular, sparsely-activated design of MoEs suggests that safety mechanisms may operate differently than in dense models, raising questions about their robustness. In this paper, we present GateBreaker, the first training-free, lightweight, and architecture-agnostic attack framework that compromises the safety alignment of modern MoE LLMs at inference time. GateBreaker operates in three stages: (i) gate-level profiling, which identifies safety experts disproportionately routed on harmful inputs, (ii) expert-level localization, which localizes the safety structure within safety experts, and (iii) targeted safety removal, which disables the identified safety structure to compromise the safety alignment. Our study shows that MoE safety concentrates within a small subset of neurons coordinated by sparse routing. Selective disabling of these neurons, approximately 3% of neurons in the targeted expert layers, significantly increases the averaged attack success rate (ASR) from 7.4% to 64.9% against the eight latest aligned MoE LLMs with limited utility degradation. These safety neurons transfer across models within the same family, raising ASR from 17.9% to 67.7% with one-shot transfer attack. Furthermore, GateBreaker generalizes to five MoE vision language models (VLMs) with 60.9% ASR on unsafe image inputs.


翻译:混合专家(Mixture-of-Experts, MoE)架构通过为每个输入仅激活稀疏的参数子集,推动了大语言模型(Large Language Models, LLMs)的规模化发展,从而以较低的计算成本实现了最先进的性能。随着此类模型在关键领域日益广泛地部署,理解并强化其对齐机制对于防止有害输出至关重要。然而,现有的大语言模型安全性研究几乎完全集中于密集架构,导致MoE模型独特的安全特性在很大程度上未被探究。MoE模块化、稀疏激活的设计意味着其安全机制的运行方式可能与密集模型不同,这引发了关于其鲁棒性的疑问。本文提出了GateBreaker,这是首个无需训练、轻量级且架构无关的攻击框架,可在推理时破坏现代MoE大语言模型的安全对齐。GateBreaker分三个阶段运行:(i)网关级分析,识别在有害输入上被过度路由的安全专家;(ii)专家级定位,定位安全专家内部的安全结构;(iii)针对性安全移除,禁用已识别的安全结构以破坏安全对齐。我们的研究表明,MoE的安全性集中于由稀疏路由协调的少量神经元子集内。选择性禁用这些神经元(约占目标专家层中神经元的3%),可将对八种最新对齐MoE大语言模型的平均攻击成功率(Attack Success Rate, ASR)从7.4%显著提升至64.9%,同时效用下降有限。这些安全神经元可在同一模型家族内跨模型迁移,通过单次迁移攻击将ASR从17.9%提升至67.7%。此外,GateBreaker可推广至五种MoE视觉语言模型(Vision Language Models, VLMs),在不安全图像输入上达到60.9%的ASR。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员