Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset to train models for vulnerability identification. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first.
翻译:静态分析工具被广泛用于脆弱性检测,因为它们了解行为复杂且代码线数以百万计的程序。 尽管静态分析工具很受欢迎, 但静态分析工具已知会产生过多的假正数。 机器学习模型最近理解程序语言的能力在应用静态分析时开辟了新的可能性。 然而, 用于培训脆弱性识别模型的现有数据集存在多种限制, 如有限的错误背景、 有限尺寸、 合成和不现实源代码。 我们提议D2A, 一种基于差异的分析方法, 以静态分析工具报告的标签问题 。 D2A 数据集是通过分析多个开放源项目的版本配对来构建的。 我们从每个项目中选择错误修正承诺, 并对承诺之前和之后的版本进行静态分析。 如果在承诺后的相应版本中检测到的一些问题消失, 它们很可能是被承诺固定的真正的错误。 我们使用 D2A 生成一个大型标签化的数据集, 用于培训脆弱性识别模型。 我们显示, 数据集可以用来在静态分析所报告的问题中建立可能的分类器, 从而帮助开发者优先度和调查真实的可能性。