Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix ``Sure, here is (harmful request)''. While straightforward, this objective has two limitations: limited control over model behaviors, yielding incomplete or unrealistic jailbroken responses, and a rigid format that hinders optimization. We introduce AdvPrefix, a plug-and-play prefix-forcing objective that selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. AdvPrefix integrates seamlessly into existing jailbreak attacks to mitigate the previous limitations for free. For example, replacing GCG's default prefixes on Llama-3 improves nuanced attack success rates from 14% to 80%, revealing that current safety alignment fails to generalize to new prefixes. Code and selected prefixes are released at github.com/facebookresearch/jailbreak-objectives.
翻译:当前针对大语言模型(LLM)的越狱攻击普遍采用一个共同目标:使模型以“当然,这是(有害请求)”为前缀生成回复。该目标虽简单直接,却存在两个局限:一是对模型行为的控制有限,导致生成的越狱回复不完整或不切实际;二是其固定格式限制了优化空间。本文提出AdvPrefix,一种即插即用的前缀强制目标函数,通过结合两个准则——高前缀填充攻击成功率和低负对数似然——来选取一个或多个模型依赖的前缀。AdvPrefix可无缝集成到现有越狱攻击中,以零成本缓解上述局限。例如,在Llama-3模型上用AdvPrefix替换GCG的默认前缀,可将细致化攻击成功率从14%提升至80%,这表明当前的安全对齐机制未能泛化至新前缀。代码及精选前缀已发布于github.com/facebookresearch/jailbreak-objectives。