D2A:使用差异分析的基于AI的脆弱程度探测方法的数据集 (D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis)

Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset to train models for vulnerability identification. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first.

翻译：静态分析工具被广泛用于脆弱性检测,因为它们了解行为复杂且代码线数以百万计的程序。尽管静态分析工具很受欢迎, 但静态分析工具已知会产生过多的假正数。机器学习模型最近理解程序语言的能力在应用静态分析时开辟了新的可能性。然而, 用于培训脆弱性识别模型的现有数据集存在多种限制, 如有限的错误背景、有限尺寸、合成和不现实源代码。我们提议D2A, 一种基于差异的分析方法, 以静态分析工具报告的标签问题。 D2A 数据集是通过分析多个开放源项目的版本配对来构建的。我们从每个项目中选择错误修正承诺, 并对承诺之前和之后的版本进行静态分析。如果在承诺后的相应版本中检测到的一些问题消失, 它们很可能是被承诺固定的真正的错误。我们使用 D2A 生成一个大型标签化的数据集, 用于培训脆弱性识别模型。我们显示, 数据集可以用来在静态分析所报告的问题中建立可能的分类器, 从而帮助开发者优先度和调查真实的可能性。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

80+阅读 · 2020年7月26日

因果图，Causal Graphs，52页ppt

专知会员服务

250+阅读 · 2020年4月19日

【Google可解释人工智能白皮书】27页pdf，AI Explainability Whitepaper ，Introduction to AI Explanations for AI Platform

专知会员服务

127+阅读 · 2019年12月13日

深度学习界圣经“花书”《Deep Learning》中文版来了

专知会员服务

239+阅读 · 2019年10月26日