We introduce a new benchmark for assessing AI models' capabilities and risks in automated software exploitation, focusing on their ability to detect and exploit vulnerabilities in real-world software systems. Using DARPA's AI Cyber Challenge (AIxCC) framework and the Nginx challenge project, a deliberately modified version of the widely used Nginx web server, we evaluate several leading language models, including OpenAI's o1-preview and o1-mini, Anthropic's Claude-3.5-sonnet-20241022 and Claude-3.5-sonnet-20240620, Google DeepMind's Gemini-1.5-pro, and OpenAI's earlier GPT-4o model. Our findings reveal that these models vary significantly in their success rates and efficiency, with o1-preview achieving the highest success rate of 64.71 percent and o1-mini and Claude-3.5-sonnet-20241022 providing cost-effective but less successful alternatives. This benchmark establishes a foundation for systematically evaluating the AI cyber risk posed by automated exploitation tools.
翻译:暂无翻译