Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software.
翻译:大型语言模型(LLMs)在理解和生成代码方面已展现出卓越能力,但其能否严格推理并证明代码正确性仍存疑问。本文针对LLMs为Rust系统软件生成正确性证明的能力进行了全面研究。我们构建了新的系统验证基准套件VeruSAGE-Bench,其中包含从八个开源Verus验证的Rust系统中提取的849项证明任务。此外,我们设计了不同的智能体系统以匹配不同LLMs(o4-mini、GPT-5、Sonnet 4和Sonnet 4.5)的优势与局限。研究表明,需要针对不同类型的LLMs采用不同的工具与智能体配置来激发其系统验证能力。本研究中最佳的LLM-智能体组合在VeruSAGE-Bench中完成了超过80%的系统验证任务,同时在一组未包含于VeruSAGE-Bench(因尚未被人类专家完成)的系统证明任务中达成超过90%的完成率。该结果彰显了LLM辅助开发可验证系统软件的巨大潜力。