干草堆中的越狱攻击 (Jailbreaking in the Haystack)

Recent advances in long-context language models (LMs) have enabled million-token inputs, expanding their capabilities across complex tasks like computer-use agents. Yet, the safety implications of these extended contexts remain unclear. To bridge this gap, we introduce NINJA (short for Needle-in-haystack jailbreak attack), a method that jailbreaks aligned LMs by appending benign, model-generated content to harmful user goals. Critical to our method is the observation that the position of harmful goals play an important role in safety. Experiments on standard safety benchmark, HarmBench, show that NINJA significantly increases attack success rates across state-of-the-art open and proprietary models, including LLaMA, Qwen, Mistral, and Gemini. Unlike prior jailbreaking methods, our approach is low-resource, transferable, and less detectable. Moreover, we show that NINJA is compute-optimal -- under a fixed compute budget, increasing context length can outperform increasing the number of trials in best-of-N jailbreak. These findings reveal that even benign long contexts -- when crafted with careful goal positioning -- introduce fundamental vulnerabilities in modern LMs.

翻译：近期长上下文语言模型（LMs）的进展已能处理百万级token的输入，扩展了其在复杂任务（如计算机使用代理）中的能力。然而，这种扩展上下文的安全性影响尚不明确。为填补这一空白，我们提出了NINJA（Needle-in-haystack jailbreak attack的简称），该方法通过将模型生成的良性内容附加到有害用户目标之后，实现对对齐语言模型的越狱攻击。我们方法的关键在于观察到有害目标的位置对安全性具有重要影响。在标准安全基准测试HarmBench上的实验表明，NINJA能显著提升包括LLaMA、Qwen、Mistral和Gemini在内的前沿开源及专有模型的攻击成功率。与先前的越狱方法不同，我们的方法具有低资源消耗、可迁移性强且更不易检测的特点。此外，我们证明NINJA具有计算最优性——在固定计算预算下，增加上下文长度优于增加N次尝试中的最佳越狱次数。这些发现表明，即使是良性的长上下文——若通过精心设计的目标定位构建——也会在现代语言模型中引入根本性漏洞。