Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent's cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline -- without any external assistance. These results highlight the need to evaluate agents' cybersecurity risk in a dynamic manner, painting a more representative picture of risk.
翻译:基础模型正日益成为更优秀的自主编程工具,这引发了它们可能自动化危险攻击性网络作战的担忧。当前的前沿模型审计探究此类智能体的网络安全风险,但大多未能考虑现实世界中攻击者可利用的自由度。特别是在强验证机制和经济激励下,攻击性网络安全智能体易于被潜在攻击者通过迭代方式改进。我们认为,网络安全领域的评估应考虑扩展的威胁模型,强调攻击者在固定计算预算下,于有状态与无状态环境中可能具备的不同自由度。研究表明,即使使用相对较小的计算预算(本研究中为8个H100 GPU小时),攻击者无需外部协助即可将智能体在InterCode CTF上的网络安全能力较基线提升超过40%。这些结果凸显了以动态方式评估智能体网络安全风险的必要性,从而描绘更具代表性的风险图景。