Training machine learning approaches for vulnerability identification and producing reliable tools to assist developers in implementing quality software -- free of vulnerabilities -- is challenging due to the lack of large datasets and real data. Researchers have been looking at these issues and building datasets. However, these datasets usually miss natural language artifacts and programming language diversity. We scraped the entire CVE details database for GitHub references and augmented the data with 3 security-related datasets. We used the data to create a ground-truth dataset of natural language artifacts (such as commit messages, commits comments, and summaries), meta-data and code changes. Our dataset integrates a total of 8057 security-relevant commits -- the equivalent to 5942 security patches -- from 1339 different projects spanning 146 different types of vulnerabilities and 20 languages. A dataset of 110k non-security-related commits is also provided. Data and scripts are all available on GitHub. Data is stored in a .CSV file. Codebases can be downloaded using our scripts. Our dataset is a valuable asset to answer research questions on different topics such as the identification of security-relevant information using NLP models; software engineering and security best practices; and, vulnerability detection and patching; and, security program analysis.
翻译:由于缺乏大型数据集和真实数据,确定脆弱性的培训机学习方法以及制作可靠工具以协助开发者实施优质软件 -- -- 不带脆弱性 -- -- 具有挑战性,因为缺乏大型数据集和真实数据。研究人员一直在研究这些问题和建立数据集。然而,这些数据集通常没有自然语言文物和编程语言多样性。我们为GitHub参考文献筛选了整个CVE详细数据库,并以3个安全相关数据集扩大了数据。我们利用这些数据创建了天然语言文物(如承诺信息、承诺评论和摘要)、元数据和代码变化的地面真真真象数据集。我们的数据集共整合了8057个与安全相关的承诺 -- -- 相当于5942个安全补丁 -- -- 共来自1339个不同项目,涉及146种不同类型脆弱性和20种语言。我们还提供了110k非安全相关承诺的数据集。数据和脚本都存放在 GitHub 文档中。数据库可以使用我们的脚本下载。我们的最佳数据库是一种宝贵的资产,用来回答不同专题的研究问题,如安全脆弱性、检测和与安全相关的软件分析;NP;安全模型的识别和修正。