Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. While existing jailbreak attacks largely rely on single-turn or multi-turn prompt manipulations, or inject static in-context examples, these methods suffer from limited effectiveness, inefficiency, or semantic drift. We introduce Response Attack (RA), a novel framework that strategically leverages intermediate, mildly harmful responses as contextual primers within a dialogue. By reformulating harmful queries and injecting these intermediate responses before issuing a targeted trigger prompt, RA exploits a previously overlooked vulnerability in LLMs. Extensive experiments across eight state-of-the-art LLMs show that RA consistently achieves significantly higher attack success rates than nine leading jailbreak baselines. Our results demonstrate that the success of RA is directly attributable to the strategic use of intermediate responses, which induce models to generate more explicit and relevant harmful content while maintaining stealth, efficiency, and fidelity to the original query. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.
翻译:上下文启动效应——即先前的刺激会隐性地影响后续判断——为大型语言模型(LLMs)提供了一个尚未被探索的攻击面。我们发现了一种上下文启动漏洞,即对话中先前的响应可以引导模型后续行为,使其生成违反策略的内容。现有的越狱攻击主要依赖于单轮或多轮提示操纵,或注入静态的上下文示例,但这些方法存在有效性有限、效率低下或语义漂移等问题。我们提出了响应攻击(Response Attack, RA),这是一种新颖的框架,它策略性地利用对话中生成的、具有轻微危害性的中间响应作为上下文启动器。通过重新表述有害查询并在发出目标触发提示前注入这些中间响应,RA利用了LLMs中一个先前被忽视的漏洞。在八个最先进的LLMs上进行的大量实验表明,RA的攻击成功率始终显著高于九个领先的越狱基线方法。我们的结果表明,RA的成功直接归因于对中间响应的策略性使用,这些响应诱导模型生成更明确、更相关的有害内容,同时保持了隐蔽性、效率以及对原始查询的忠实度。代码和数据可在 https://github.com/Dtc7w3PQ/Response-Attack 获取。