Large Language Models (LLMs) like Codex are powerful tools for performing code completion and code generation tasks as they are trained on billions of lines of code from publicly available sources. Moreover, these models are capable of generating code snippets from Natural Language (NL) descriptions by learning languages and programming practices from public GitHub repositories. Although LLMs promise an effortless NL-driven deployment of software applications, the security of the code they generate has not been extensively investigated nor documented. In this work, we present LLMSecEval, a dataset containing 150 NL prompts that can be leveraged for assessing the security performance of such models. Such prompts are NL descriptions of code snippets prone to various security vulnerabilities listed in MITRE's Top 25 Common Weakness Enumeration (CWE) ranking. Each prompt in our dataset comes with a secure implementation example to facilitate comparative evaluations against code produced by LLMs. As a practical application, we show how LLMSecEval can be used for evaluating the security of snippets automatically generated from NL descriptions.
翻译:大型语言模型(LLMs)如Codex是执行代码补全和代码生成任务的强大工具,因为它们训练于来自公开可用资源的数十亿行代码。此外,这些模型可以从自然语言(NL)描述生成代码片段,通过学习公共GitHub存储库中的语言和编程惯例。虽然LLMs承诺轻松的基于NL的软件应用部署,但却很少有关于它们生成的代码安全性的广泛调查和文档。在这项工作中,我们提供了LLMSecEval数据集,其中包含150个NL提示,可用于评估此类模型的安全性能。这些提示是列表在MITRE的Top 25常见弱点枚举(CWE)排名中的容易受到各种安全漏洞的代码片段的NL描述。我们数据集中的每个提示都配有一个安全实现示例,以便与LLMs生成的代码进行比较评估。作为一个实际应用,我们展示了LLMSecEval如何被用于评估从NL描述自动生成的代码片段的安全性。