Several advances in deep learning have been successfully applied to the software development process. Of recent interest is the use of neural language models to build tools, such as Copilot, that assist in writing code. In this paper we perform a comparative empirical analysis of Copilot-generated code from a security perspective. The aim of this study is to determine if Copilot is as bad as human developers - we investigate whether Copilot is just as likely to introduce the same software vulnerabilities that human developers did. Using a dataset of C/C++ vulnerabilities, we prompt Copilot to generate suggestions in scenarios that previously led to the introduction of vulnerabilities by human developers. The suggestions are inspected and categorized in a 2-stage process based on whether the original vulnerability or the fix is reintroduced. We find that Copilot replicates the original vulnerable code ~33% of the time while replicating the fixed code at a ~25% rate. However this behavior is not consistent: Copilot is more susceptible to introducing some types of vulnerability than others and is more likely to generate vulnerable code in response to prompts that correspond to older vulnerabilities than newer ones. Overall, given that in a substantial proportion of instances Copilot did not generate code with the same vulnerabilities that human developers had introduced previously, we conclude that Copilot is not as bad as human developers at introducing vulnerabilities in code.
翻译:在深层次学习方面的一些进步被成功地应用于软件开发进程。 最近人们感兴趣的是使用神经语言模型来建立工具,例如Cocilot, 帮助写入代码。 在本文中,我们从安全角度对COpil产生代码进行了比较经验分析。 本研究的目的是确定Copil是否和人类开发者一样糟糕,我们调查C/C++脆弱性数据库是否同样可能引入与人类开发者相同的软件脆弱性。 使用C/C++脆弱性数据集,我们促使共同试点项目在以前导致人类开发者引入脆弱性的情景中产生建议。 这些建议在两个阶段的进程中被检查和分类, 依据是重新引入原始脆弱性还是修正。 我们发现, 共同试点复制了原始脆弱代码~ 33%的时间, 而复制了固定代码的速率为~ 25%。 然而,这种行为并不一致: 共同试点项目比人类开发者更容易引入某些类型的脆弱性, 并更有可能生成脆弱代码, 以适应比新开发者更老的脆弱性。 总体而言, 在相当一部分情况下,我们引入了错误的脆弱程度, 共同试点的开发者没有像之前的代码。