进化方法而非提示：大语言模型越狱攻击的演化合成 (Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs)

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce \textbf{EvoSynth}, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.

翻译：针对大语言模型（LLMs）的自动化红队测试框架日益复杂，但它们存在一个根本性局限：其越狱逻辑仅限于选择、组合或优化已有的攻击策略。这束缚了其创造性，使其无法自主发明全新的攻击机制。为突破这一局限，我们提出了 **EvoSynth**，一个将范式从攻击规划转向越狱方法演化合成的自主框架。EvoSynth 不优化提示，而是采用多智能体系统自主设计、演化并执行新颖的、基于代码的攻击算法。其关键特性在于包含一个代码级的自我修正循环，使其能够根据失败反馈迭代重写自身的攻击逻辑。通过大量实验，我们证明 EvoSynth 不仅实现了新的技术突破——在对 Claude-Sonnet-4.5 等高鲁棒性模型的攻击中达到了 85.5% 的攻击成功率（ASR），而且生成的攻击方法在多样性上显著超越现有技术。我们开源了该框架，以促进越狱方法演化合成这一新方向的未来研究。代码发布于：https://github.com/dongdongunique/EvoSynth。