BountyBench：AI攻击与防御智能体对现实世界网络安全系统的美元价值影响评估 (BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems)

Andy K. Zhang,Joey Ji,Celeste Menders,Riya Dulepet,Thomas Qin,Ron Y. Wang,Junrong Wu,Kyleen Liao,Jiliang Li,Jinghan Hu,Sara Hong,Nardos Demilew,Shivatmica Murgai,Jason Tran,Nishka Kacheria,Ethan Ho,Denis Liu,Lauren McLane,Olivia Bruvik,Dai-Rong Han,Seungwoo Kim,Akhil Vyas,Cuiyuanxiu Chen,Ryan Li,Weiran Xu,Jonathan Z. Ye,Prerit Choudhary,Siddharth M. Bhatia,Vikram Sivashankar,Yuxuan Bao,Dawn Song,Dan Boneh,Daniel E. Ho,Percy Liang

from arxiv, 113 pages

AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a given vulnerability), and Patch (patching a given vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards from \$10 to \$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a given vulnerability. We evaluate 10 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, Qwen3 235B A22B, Llama 4 Maverick, and DeepSeek-R1. Given up to three attempts, the top-performing agents are Codex CLI: o3-high (12.5% on Detect, mapping to \$3,720; 90% on Patch, mapping to \$14,152), Custom Agent: Claude 3.7 Sonnet Thinking (67.5% on Exploit), and Codex CLI: o4-mini (90% on Patch, mapping to \$14,422). Codex CLI: o3-high, Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 17.5-67.5% and Patch scores of 25-60%.

翻译：AI智能体具有显著改变网络安全格局的潜力。本文提出了首个能够捕捉现实世界动态系统中攻防网络能力的框架。通过BountyBench实例化该框架，我们搭建了25个包含复杂现实世界代码库的系统。为捕捉漏洞生命周期，我们定义了三种任务类型：检测（发现新漏洞）、利用（利用给定漏洞）和修补（修复给定漏洞）。针对检测任务，我们构建了一种新的成功指标，该指标适用于多种漏洞类型并提供局部化评估。我们为每个系统手动配置环境，包括安装软件包、搭建服务器及初始化数据库。我们添加了40个漏洞赏金任务，其货币奖励从10美元到30,485美元不等，覆盖了OWASP十大安全风险中的9类。为调节任务难度，我们设计了一种基于信息引导检测的新策略，实现了从零日漏洞识别到给定漏洞利用的难度递进。我们评估了10种智能体：Claude Code、OpenAI Codex CLI（含o3-high与o4-mini版本）以及采用o3-high、GPT-4.1、Gemini 2.5 Pro Preview、Claude 3.7 Sonnet Thinking、Qwen3 235B A22B、Llama 4 Maverick和DeepSeek-R1构建的自定义智能体。在最多三次尝试的限制下，表现最佳的智能体包括：Codex CLI: o3-high（检测任务成功率12.5%，对应3,720美元；修补任务成功率90%，对应14,152美元）、Custom Agent: Claude 3.7 Sonnet Thinking（利用任务成功率67.5%）以及Codex CLI: o4-mini（修补任务成功率90%，对应14,422美元）。Codex CLI: o3-high、Codex CLI: o4-mini和Claude Code在防御方面表现更优，修补任务成功率分别为90%、90%和87.5%，而利用任务成功率分别为47.5%、32.5%和57.5%；自定义智能体在攻防能力上相对均衡，利用任务成功率为17.5-67.5%，修补任务成功率为25-60%。