As LLMs see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work has demonstrated that security is often overlooked, exposing that LLMs are prone to generating code with security vulnerabilities. These insights were enabled by specialized benchmarks, crafted through significant manual effort by security experts. However, relying on manually-crafted benchmarks is insufficient in the long term, because benchmarks (i) naturally end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AutoBaxBuilder, a framework that generates tasks and tests for code security benchmarking from scratch. We introduce a robust pipeline with fine-grained plausibility checks, leveraging the code understanding capabilities of LLMs to construct functionality tests and end-to-end security-probing exploits. To confirm the quality of the generated benchmark, we conduct both a qualitative analysis and perform quantitative experiments, comparing it against tasks constructed by human experts. We use AutoBaxBuilder to construct entirely new tasks and release them to the public as AutoBaxBench, together with a thorough evaluation of the security capabilities of LLMs on these tasks. We find that a new task can be generated in under 2 hours, costing less than USD 10.


翻译:随着大语言模型在软件工程领域的广泛应用,可靠评估其生成代码的正确性与安全性变得至关重要。值得注意的是,已有研究表明安全性常被忽视,大语言模型倾向于生成存在安全漏洞的代码。这些发现依赖于由安全专家通过大量人工努力构建的专用基准测试集。然而,长期依赖人工构建的基准测试存在不足,因为基准测试(i)不可避免地会污染训练数据,(ii)必须扩展到新任务以提供更全面的评估视角,(iii)需不断提升难度以挑战更强大的大语言模型。本研究针对这些挑战提出了AutoBaxBuilder框架,该框架能够从零开始生成用于代码安全基准测试的任务与测试用例。我们引入了包含细粒度合理性检查的鲁棒流程,利用大语言模型的代码理解能力构建功能测试和端到端安全探测漏洞利用方案。为验证生成基准测试的质量,我们进行了定性分析和定量实验,将其与专家构建的任务进行对比。基于AutoBaxBuilder构建了全新任务集并公开为AutoBaxBench,同时对这些任务上大语言模型的安全能力进行了全面评估。实验表明,每个新任务的生成时间不足2小时,成本低于10美元。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员