Detecting security vulnerabilities in software before they are exploited has been a challenging problem for decades. Traditional code analysis methods have been proposed, but are often ineffective and inefficient. In this work, we model software vulnerability detection as a natural language processing (NLP) problem with source code treated as texts, and address the automated software venerability detection with recent advanced deep learning NLP models assisted by transfer learning on written English. For training and testing, we have preprocessed the NIST NVD/SARD databases and built a dataset of over 100,000 files in $C$ programming language with 123 types of vulnerabilities. The extensive experiments generate the best performance of over 93\% accuracy in detecting security vulnerabilities.
翻译:数十年来,在软件被利用之前发现软件安全薄弱环节是一个具有挑战性的问题,传统代码分析方法已经提出,但往往无效和低效;在这项工作中,我们将软件脆弱性检测作为一种天然语言处理(NLP)问题,将源代码作为文本处理,并将软件脆弱性检测作为天然语言处理(NLP)问题,并将软件脆弱性检测与最近深入学习的NLP模型联系起来,借助书面英文的转移学习。为了培训和测试,我们预先处理了NIST NVD/SARD数据库,用123种脆弱性类型C$的编程语言建立了10万多份文件数据集。广泛的实验在发现安全薄弱环节方面产生了93个以上准确度的最佳效果。