真实安全补丁的地真数据集 (A ground-truth dataset of real security patches)

Training machine learning approaches for vulnerability identification and producing reliable tools to assist developers in implementing quality software -- free of vulnerabilities -- is challenging due to the lack of large datasets and real data. Researchers have been looking at these issues and building datasets. However, these datasets usually miss natural language artifacts and programming language diversity. We scraped the entire CVE details database for GitHub references and augmented the data with 3 security-related datasets. We used the data to create a ground-truth dataset of natural language artifacts (such as commit messages, commits comments, and summaries), meta-data and code changes. Our dataset integrates a total of 8057 security-relevant commits -- the equivalent to 5942 security patches -- from 1339 different projects spanning 146 different types of vulnerabilities and 20 languages. A dataset of 110k non-security-related commits is also provided. Data and scripts are all available on GitHub. Data is stored in a .CSV file. Codebases can be downloaded using our scripts. Our dataset is a valuable asset to answer research questions on different topics such as the identification of security-relevant information using NLP models; software engineering and security best practices; and, vulnerability detection and patching; and, security program analysis.

翻译：由于缺乏大型数据集和真实数据,确定脆弱性的培训机学习方法以及制作可靠工具以协助开发者实施优质软件 -- -- 不带脆弱性 -- -- 具有挑战性,因为缺乏大型数据集和真实数据。研究人员一直在研究这些问题和建立数据集。然而,这些数据集通常没有自然语言文物和编程语言多样性。我们为GitHub参考文献筛选了整个CVE详细数据库,并以3个安全相关数据集扩大了数据。我们利用这些数据创建了天然语言文物(如承诺信息、承诺评论和摘要)、元数据和代码变化的地面真真真象数据集。我们的数据集共整合了8057个与安全相关的承诺 -- -- 相当于5942个安全补丁 -- -- 共来自1339个不同项目,涉及146种不同类型脆弱性和20种语言。我们还提供了110k非安全相关承诺的数据集。数据和脚本都存放在 GitHub 文档中。数据库可以使用我们的脚本下载。我们的最佳数据库是一种宝贵的资产,用来回答不同专题的研究问题,如安全脆弱性、检测和与安全相关的软件分析;NP;安全模型的识别和修正。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

54+阅读 · 2021年1月20日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

实用信息安全管理，253页pdf，Practical Information Security Management

专知会员服务

25+阅读 · 2020年5月31日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日