We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable functions extracted from 7,861 commits. Our dataset covers 305 more projects than all previous datasets combined. We show that increasing the diversity and volume of training data improves the performance of deep learning models for vulnerability detection. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. However, we also identify hopeful future research directions. We demonstrate that large language models (LLMs) are the future for vulnerability detection, outperforming Graph Neural Networks (GNNs) with manual feature engineering. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.
翻译:我们提出并发布了一个新的易受攻击源代码数据集。我们通过爬取安全问题网站,从相应的项目中提取漏洞修复提交和源代码来管理数据集。我们的新数据集包含150个CWE,26,635个易受攻击函数和352,606个不易受攻击函数,这些函数从7,861个提交中提取。我们的数据集涵盖了比以往任何数据集结合起来还多的305个项目。我们表明,增加训练数据的多样性和数量可以改善基于深度学习模型的漏洞检测性能。将我们的新数据集与先前的数据集相结合,我们对使用深度学习检测软件漏洞的挑战和有前途的研究方向进行了分析。我们研究了4个系列中的11个模型架构。我们的结果表明,由于高误报率、低F1分数和检测难度较高的CWE,深度学习仍未准备好用于漏洞检测。特别地,我们展示了深度学习基于模型部署的重要普适性挑战。然而,我们也确定了有希望的未来研究方向。我们展示了大型语言模型(LLMs)是漏洞检测的未来,优于具有手动特征工程的图形神经网络(GNNs)。此外,开发源代码特定的预训练目标是改进漏洞检测性能的有希望的研究方向。