VMCDL: 基于源代码控制流的级联深度学习漏洞挖掘 (VMCDL: Vulnerability Mining Based on Cascaded Deep Learning Under Source Control Flow)

from arxiv, The relevant mathematical derivation has some problems such as lack of coherence, and the location of sensitive words and the formation of slices need to be further elaborated

With the rapid development of the computer industry and computer software, the risk of software vulnerabilities being exploited has greatly increased. However, there are still many shortcomings in the existing mining techniques for leakage source research, such as high false alarm rate, coarse-grained detection, and dependence on expert experience. In this paper, we mainly use the c/c++ source code data of the SARD dataset, process the source code of CWE476, CWE469, CWE516 and CWE570 vulnerability types, test the Joern vulnerability scanning function of the cutting-edge tool, and propose a new cascading deep learning model VMCDL based on source code control flow to effectively detect vulnerabilities. First, this paper uses joern to locate and extract sensitive functions and statements to form a sensitive statement library of vulnerable code. Then, the CFG flow vulnerability code snippets are generated by bidirectional breadth-first traversal, and then vectorized by Doc2vec. Finally, the cascade deep learning model based on source code control flow is used for classification to obtain the classification results. In the experimental evaluation, we give the test results of Joern on specific vulnerabilities, and give the confusion matrix and label data of the binary classification results of the model algorithm on single vulnerability type source code, and compare and verify the five indicators of FPR, FNR, ACC, P and F1, respectively reaching 10.30%, 5.20%, 92.50%,85.10% and 85.40%,which shows that it can effectively reduce the false alarm rate of static analysis.

翻译：随着计算机行业和计算机软件的迅速发展，软件漏洞被攻击的风险大大增加。然而，现有的泄漏源研究挖掘技术仍存在许多缺陷，如高误报率、粗粒度检测和依赖于专家经验等。本文主要使用SARD数据集的c/c++源代码数据，处理CWE476、CWE469、CWE516和CWE570漏洞类型的源代码，测试前沿工具Joern漏洞扫描功能，并提出一种基于源代码控制流的级联深度学习模型VMCDL，有效地检测漏洞。首先，本文使用Joern定位和提取敏感函数和语句，形成易漏代码敏感语句库。然后，通过双向广度优先遍历生成CFG流漏洞代码段，再通过Doc2Vec进行向量化。最后，采用基于源代码控制流的级联深度学习模型进行分类，得到分类结果。在实验评价中，我们给出了Joern在特定漏洞上的测试结果，并给出了模型算法对单个漏洞类型源代码的二分类结果的混淆矩阵和标签数据，分别达到了10.30%、5.20%、92.50%、85.10%和85.40%的五个指标，表明它能够有效降低静态分析的误报率。