As the scale and complexity of jailbreaking attacks on large language models (LLMs) continue to escalate, their efficiency and practical applicability are constrained, posing a profound challenge to LLM security. Jailbreaking techniques have advanced from manual prompt engineering to automated methodologies. Recent advances have automated jailbreaking approaches that harness LLMs to generate jailbreak instructions and adversarial examples, delivering encouraging results. Nevertheless, these methods universally include an LLM generation phase, which, due to the complexities of deploying and reasoning with LLMs, impedes effective implementation and broader adoption. To mitigate this issue, we introduce \textbf{Adversarial Prompt Distillation}, an innovative framework that integrates masked language modeling, reinforcement learning, and dynamic temperature control to distill LLM jailbreaking prowess into smaller language models (SLMs). This methodology enables efficient, robust jailbreak attacks while maintaining high success rates and accommodating a broader range of application contexts. Empirical evaluations affirm the approach's superiority in attack efficacy, resource optimization, and cross-model versatility. Our research underscores the practicality of transferring jailbreak capabilities to SLMs, reveals inherent vulnerabilities in LLMs, and provides novel insights to advance LLM security investigations. Our code is available at: https://github.com/lxgem/Efficient_and_Stealthy_Jailbreak_Attacks_via_Adversarial_Prompt.
翻译:随着针对大语言模型(LLMs)的越狱攻击规模和复杂性持续升级,其效率和实际适用性受到限制,对LLM安全构成了严峻挑战。越狱技术已从手动提示工程发展到自动化方法。最新进展实现了自动化越狱方法,利用LLMs生成越狱指令和对抗性示例,取得了令人鼓舞的成果。然而,这些方法普遍包含LLM生成阶段,由于LLMs部署和推理的复杂性,阻碍了有效实施和广泛应用。为缓解此问题,我们提出了\textbf{对抗性提示蒸馏},这是一个创新框架,整合了掩码语言建模、强化学习和动态温度控制,将LLMs的越狱能力蒸馏到小语言模型(SLMs)中。该方法实现了高效、鲁棒的越狱攻击,同时保持高成功率并适应更广泛的应用场景。实证评估证实了该方法在攻击效能、资源优化和跨模型通用性方面的优越性。我们的研究强调了将越狱能力迁移到SLMs的实用性,揭示了LLMs固有的脆弱性,并为推进LLM安全研究提供了新见解。我们的代码发布于:https://github.com/lxgem/Efficient_and_Stealthy_Jailbreak_Attacks_via_Adversarial_Prompt。