VeruSAGE：基于智能体的Rust系统验证研究 (VeruSAGE: A Study of Agent-Based Verification for Rust Systems)

Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software.

翻译：大型语言模型（LLMs）在理解和生成代码方面已展现出卓越能力，但其能否严格推理并证明代码正确性仍存疑问。本文针对LLMs为Rust系统软件生成正确性证明的能力进行了全面研究。我们构建了新的系统验证基准套件VeruSAGE-Bench，其中包含从八个开源Verus验证的Rust系统中提取的849项证明任务。此外，我们设计了不同的智能体系统以匹配不同LLMs（o4-mini、GPT-5、Sonnet 4和Sonnet 4.5）的优势与局限。研究表明，需要针对不同类型的LLMs采用不同的工具与智能体配置来激发其系统验证能力。本研究中最佳的LLM-智能体组合在VeruSAGE-Bench中完成了超过80%的系统验证任务，同时在一组未包含于VeruSAGE-Bench（因尚未被人类专家完成）的系统证明任务中达成超过90%的完成率。该结果彰显了LLM辅助开发可验证系统软件的巨大潜力。