Security patches in open-source software, providing security fixes to identified vulnerabilities, are crucial in protecting against cyberattacks. Despite the National Vulnerability Database (NVD) publishes identified vulnerabilities, a vast majority of vulnerabilities and their corresponding security patches remain beyond public exposure, e.g., in the open-source libraries that are heavily relied on by developers. An extensive security patches dataset could help end-users such as security companies, e.g., building a security knowledge base, or researchers, e.g., aiding in vulnerability research. To curate security patches including undisclosed patches at a large scale and low cost, we propose a deep neural-network-based approach built upon commits of open-source repositories. We build security patch datasets that include 38,291 security-related commits and 1,045 CVE patches from four C libraries. We manually verify each commit, among the 38,291 security-related commits, to determine if they are security-related. We devise a deep learning-based security patch identification system that consists of two neural networks: one commit-message neural network that utilizes pretrained word representations learned from our commits dataset; and one code-revision neural network that takes code before and after revision and learns the distinction on the statement level. Our evaluation results show that our system outperforms SVM and K-fold stacking algorithm, achieving as high as 87.93% F1-score and precision of 86.24%. We deployed our pipeline and learned model in an industrial production environment to evaluate the generalization ability of our approach. The industrial dataset consists of 298,917 commits from 410 new libraries that range from a wide functionality. Our experiment results and observation proved that our approach identifies security patches effectively among open-sourced projects.
翻译:开放源码软件中的安全补丁,为已查明的脆弱性提供安全修补,对于保护不受网络攻击至关重要。尽管国家脆弱性数据库(NVD)公布了已查明的脆弱性,但绝大多数的弱点及其相应的安全补丁仍然无法公开暴露,例如,在开发者高度依赖的开放源码图书馆中。广泛的安全补丁数据集可以帮助最终用户,如安保公司,例如,建立安全知识库,或研究人员,例如,协助脆弱性研究。为了弥补安全补丁,包括大规模和低成本的未披露补丁,我们提议在公开源码库承诺的基础上建立深层的神经网络。我们建立安全补丁数据集,其中包括38 291项与安全有关的承诺,以及4项C图书馆的1 045项补丁。我们在38 291个安全相关承诺中,每个都承诺确定它们是否与安全相关。我们设计了一个深层次的基于学习的安全补丁码识别系统,它由两个神经化网络组成:一个承诺的内脏观察网,在公开源码库库库库库库库库库中,我们利用预先的内码系统,从我们的数据校验了我们的数据校验了我们的系统。