The detection of software vulnerabilities (or vulnerabilities for short) is an important problem that has yet to be tackled, as manifested by the many vulnerabilities reported on a daily basis. This calls for machine learning methods for vulnerability detection. Deep learning is attractive for this purpose because it alleviates the requirement to manually define features. Despite the tremendous success of deep learning in other application domains, its applicability to vulnerability detection is not systematically understood. In order to fill this void, we propose the first systematic framework for using deep learning to detect vulnerabilities in C/C++ programs with source code. The framework, dubbed Syntax-based, Semantics-based, and Vector Representations (SySeVR), focuses on obtaining program representations that can accommodate syntax and semantic information pertinent to vulnerabilities. Our experiments with 4 software products demonstrate the usefulness of the framework: we detect 15 vulnerabilities that are not reported in the National Vulnerability Database. Among these 15 vulnerabilities, 7 are unknown and have been reported to the vendors, and the other 8 have been "silently" patched by the vendors when releasing newer versions of the pertinent software products.
翻译:软件脆弱性(或短期脆弱性)的发现是一个尚未解决的重要问题,每天报告的许多脆弱性就表明了这一点。这要求用机器学习方法来检测脆弱性。深层次的学习在这方面具有吸引力,因为它减轻了手动界定特征的要求。尽管在其他应用领域深层学习的巨大成功,但是它对于脆弱性检测的适用性没有得到系统的理解。为了填补这一空白,我们提议了第一个系统框架,用于利用深层学习在C/C+++程序中使用源代码来检测脆弱性。称为语法、语法和矢量表示(SySeVR)的框架,重点是获得能够容纳与脆弱性有关的合成和语义信息的节目表述。我们用4个软件产品进行的实验证明了框架的有用性:我们发现了15个在国家脆弱性数据库中未报告的弱点。在这15个弱点中,7个未知,并向供应商报告,其他8个是供应商在发布相关软件产品新版本时“隐蔽”的。