DiverseVul: 一个新的基于深度学习的漏洞检测易受攻击源代码数据集 (DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection) - 专知论文

会员服务 ·

0

漏洞检测 · 数据集 · 检测性能 · 攻击 · 深度学习 ·

2023 年 4 月 1 日

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

翻译：DiverseVul: 一个新的基于深度学习的漏洞检测易受攻击源代码数据集

Yizheng Chen,Zhoujie Ding,Xinyun Chen,David Wagner

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable functions extracted from 7,861 commits. Our dataset covers 305 more projects than all previous datasets combined. We show that increasing the diversity and volume of training data improves the performance of deep learning models for vulnerability detection. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. However, we also identify hopeful future research directions. We demonstrate that large language models (LLMs) are the future for vulnerability detection, outperforming Graph Neural Networks (GNNs) with manual feature engineering. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.

翻译：我们提出并发布了一个新的易受攻击源代码数据集。我们通过爬取安全问题网站，从相应的项目中提取漏洞修复提交和源代码来管理数据集。我们的新数据集包含150个CWE，26,635个易受攻击函数和352,606个不易受攻击函数，这些函数从7,861个提交中提取。我们的数据集涵盖了比以往任何数据集结合起来还多的305个项目。我们表明，增加训练数据的多样性和数量可以改善基于深度学习模型的漏洞检测性能。将我们的新数据集与先前的数据集相结合，我们对使用深度学习检测软件漏洞的挑战和有前途的研究方向进行了分析。我们研究了4个系列中的11个模型架构。我们的结果表明，由于高误报率、低F1分数和检测难度较高的CWE，深度学习仍未准备好用于漏洞检测。特别地，我们展示了深度学习基于模型部署的重要普适性挑战。然而，我们也确定了有希望的未来研究方向。我们展示了大型语言模型（LLMs）是漏洞检测的未来，优于具有手动特征工程的图形神经网络（GNNs）。此外，开发源代码特定的预训练目标是改进漏洞检测性能的有希望的研究方向。

0

相关内容

漏洞检测

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

近期必读的六篇AAAI 2021【对抗攻击（Adversarial Attack）】相关论文和代码

专知会员服务

55+阅读 · 2021年2月17日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

近期必读的五篇KDD 2020【迁移学习 (TL) 】相关论文

近期必读的五篇KDD 2020【迁移学习 (TL) 】相关论文

专知会员服务

40+阅读 · 2020年8月25日

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

专知会员服务

60+阅读 · 2020年5月2日

基于知识图谱的深度学习推荐系统研究，21页pdf，Deep Learning on Knowledge Graph for Recommender System: A Survey

基于知识图谱的深度学习推荐系统研究，21页pdf，Deep Learning on Knowledge Graph for Recommender System: A Survey

专知会员服务

158+阅读 · 2020年4月2日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【显著性目标检测| 2019最新综述】深度学习时代的显著目标检测（Salient Object Detection in the Deep Learning Era: An In-Depth Survey），附PDF

【显著性目标检测| 2019最新综述】深度学习时代的显著目标检测（Salient Object Detection in the Deep Learning Era: An In-Depth Survey），附PDF

专知会员服务

42+阅读 · 2019年11月23日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

CVPR 2019 | 34篇 CVPR 2019 论文实现代码

CVPR 2019 | 34篇 CVPR 2019 论文实现代码

AI科技评论

21+阅读 · 2019年6月23日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

13小时2600赞，特斯拉大佬Karpathy博客《A Recipe for Training Neural Networks》

13小时2600赞，特斯拉大佬Karpathy博客《A Recipe for Training Neural Networks》

专知

18+阅读 · 2019年4月26日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

专知

23+阅读 · 2018年1月18日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于深度神经网络的自动作文评分算法研究

国家自然科学基金

1+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

基于Exemplar-Classifier思想的高分辨率光学遥感影像目标识别研究

国家自然科学基金

2+阅读 · 2013年12月31日

电力信息物理系统的恶意数据攻击问题及对策研究

国家自然科学基金

2+阅读 · 2013年12月31日

基于Linked Open Data的Web服务语义互操作关键技术

国家自然科学基金

0+阅读 · 2012年12月31日

癌细胞分泌exosome改变CTL细胞功能导致鼻咽癌免疫逃逸的机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于半监督结构化学习的跨语言映射研究

国家自然科学基金

2+阅读 · 2011年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

一个新的mRNA-like非编码RNA功能研究

国家自然科学基金

0+阅读 · 2008年12月31日

And/or trade-off in artificial neurons: impact on adversarial robustness

Arxiv

0+阅读 · 2023年5月22日

PECAN: Leveraging Policy Ensemble for Context-Aware Zero-Shot Human-AI Coordination

Arxiv

0+阅读 · 2023年5月22日

Graph Neural Networks for Text Classification: A Survey

Arxiv

34+阅读 · 2023年4月27日

Deep Learning for Time Series Anomaly Detection: A Survey

Arxiv

21+阅读 · 2022年11月9日

A Survey of Deep Learning for Low-Shot Object Detection

Arxiv

21+阅读 · 2021年12月6日

Graph Self-Supervised Learning: A Survey

Arxiv

15+阅读 · 2021年8月5日

A Comprehensive Survey on Community Detection with Deep Learning

Arxiv

14+阅读 · 2021年5月26日

Graph Learning: A Survey

Arxiv

58+阅读 · 2021年5月3日

Anomalous Instance Detection in Deep Learning: A Survey

Anomalous Instance Detection in Deep Learning: A Survey

Arxiv

29+阅读 · 2020年3月16日

A Comprehensive Survey on Transfer Learning

A Comprehensive Survey on Transfer Learning

Arxiv

121+阅读 · 2019年11月7日

VIP会员

文章信息

相关主题

相关VIP内容

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

近期必读的六篇AAAI 2021【对抗攻击（Adversarial Attack）】相关论文和代码

专知会员服务

55+阅读 · 2021年2月17日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

近期必读的五篇KDD 2020【迁移学习 (TL) 】相关论文

近期必读的五篇KDD 2020【迁移学习 (TL) 】相关论文

专知会员服务

40+阅读 · 2020年8月25日

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

专知会员服务

60+阅读 · 2020年5月2日

基于知识图谱的深度学习推荐系统研究，21页pdf，Deep Learning on Knowledge Graph for Recommender System: A Survey

基于知识图谱的深度学习推荐系统研究，21页pdf，Deep Learning on Knowledge Graph for Recommender System: A Survey

专知会员服务

158+阅读 · 2020年4月2日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【显著性目标检测| 2019最新综述】深度学习时代的显著目标检测（Salient Object Detection in the Deep Learning Era: An In-Depth Survey），附PDF

【显著性目标检测| 2019最新综述】深度学习时代的显著目标检测（Salient Object Detection in the Deep Learning Era: An In-Depth Survey），附PDF

专知会员服务

42+阅读 · 2019年11月23日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】扩展可扩展会话推荐的边界

别想太多：高效 R1 风格大型推理模型综述

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

智能体网络：用AI智能体编织下一代网络

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

CVPR 2019 | 34篇 CVPR 2019 论文实现代码

CVPR 2019 | 34篇 CVPR 2019 论文实现代码

AI科技评论

21+阅读 · 2019年6月23日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

13小时2600赞，特斯拉大佬Karpathy博客《A Recipe for Training Neural Networks》

13小时2600赞，特斯拉大佬Karpathy博客《A Recipe for Training Neural Networks》

专知

18+阅读 · 2019年4月26日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

专知

23+阅读 · 2018年1月18日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

相关论文

And/or trade-off in artificial neurons: impact on adversarial robustness

Arxiv

0+阅读 · 2023年5月22日

PECAN: Leveraging Policy Ensemble for Context-Aware Zero-Shot Human-AI Coordination

Arxiv

0+阅读 · 2023年5月22日

Graph Neural Networks for Text Classification: A Survey

Arxiv

34+阅读 · 2023年4月27日

Deep Learning for Time Series Anomaly Detection: A Survey

Arxiv

21+阅读 · 2022年11月9日

A Survey of Deep Learning for Low-Shot Object Detection

Arxiv

21+阅读 · 2021年12月6日

Graph Self-Supervised Learning: A Survey

Arxiv

15+阅读 · 2021年8月5日

A Comprehensive Survey on Community Detection with Deep Learning

Arxiv

14+阅读 · 2021年5月26日

Graph Learning: A Survey

Arxiv

58+阅读 · 2021年5月3日

Anomalous Instance Detection in Deep Learning: A Survey

Anomalous Instance Detection in Deep Learning: A Survey

Arxiv

29+阅读 · 2020年3月16日

A Comprehensive Survey on Transfer Learning

A Comprehensive Survey on Transfer Learning

Arxiv

121+阅读 · 2019年11月7日

相关基金

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于深度神经网络的自动作文评分算法研究

国家自然科学基金

1+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

基于Exemplar-Classifier思想的高分辨率光学遥感影像目标识别研究

国家自然科学基金

2+阅读 · 2013年12月31日

电力信息物理系统的恶意数据攻击问题及对策研究

国家自然科学基金

2+阅读 · 2013年12月31日

基于Linked Open Data的Web服务语义互操作关键技术

国家自然科学基金

0+阅读 · 2012年12月31日

癌细胞分泌exosome改变CTL细胞功能导致鼻咽癌免疫逃逸的机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于半监督结构化学习的跨语言映射研究

国家自然科学基金

2+阅读 · 2011年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

一个新的mRNA-like非编码RNA功能研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员