一个使用语义漏洞图的无偏Transformer源代码学习方法 (An Unbiased Transformer Source Code Learning with Semantic Vulnerability Graph)

Over the years, open-source software systems have become prey to threat actors. Even as open-source communities act quickly to patch the breach, code vulnerability screening should be an integral part of agile software development from the beginning. Unfortunately, current vulnerability screening techniques are ineffective at identifying novel vulnerabilities or providing developers with code vulnerability and classification. Furthermore, the datasets used for vulnerability learning often exhibit distribution shifts from the real-world testing distribution due to novel attack strategies deployed by adversaries and as a result, the machine learning model's performance may be hindered or biased. To address these issues, we propose a joint interpolated multitasked unbiased vulnerability classifier comprising a transformer "RoBERTa" and graph convolution neural network (GCN). We present a training process utilizing a semantic vulnerability graph (SVG) representation from source code, created by integrating edges from a sequential flow, control flow, and data flow, as well as a novel flow dubbed Poacher Flow (PF). Poacher flow edges reduce the gap between dynamic and static program analysis and handle complex long-range dependencies. Moreover, our approach reduces biases of classifiers regarding unbalanced datasets by integrating Focal Loss objective function along with SVG. Remarkably, experimental results show that our classifier outperforms state-of-the-art results on vulnerability detection with fewer false negatives and false positives. After testing our model across multiple datasets, it shows an improvement of at least 2.41% and 18.75% in the best-case scenario. Evaluations using N-day program samples demonstrate that our proposed approach achieves a 93% accuracy and was able to detect 4, zero-day vulnerabilities from popular GitHub repositories.

翻译：多年来，开放源代码软件系统成为攻击者的目标。即使开源社区迅速修补漏洞，代码漏洞筛查应成为敏捷软件开发的重要组成部分。不幸的是，当前的漏洞筛查技术无法识别新颖的漏洞或为开发人员提供代码漏洞分类。此外，漏洞学习所使用的数据集通常由于攻击者部署新颖的攻击策略而展现出与真实世界测试分布的分布移位，从而可能降低机器学习模型的性能或产生偏差。为了解决这些问题，我们提出了一个联合插值多任务无偏漏洞分类器，包括一个变形金刚“RoBERTa”和图卷积神经网络（GCN）。我们提出了一个训练过程，利用源代码中的语义漏洞图（SVG）表示，该表示法由从串行流、控制流和数据流中集成的边构成，以及一个称为“偷猎者流”（PF）的新流。偷猎者流边减少了动态和静态程序分析之间的差距，并处理复杂的长距离依赖性。此外，我们的方法通过集成焦点损失目标函数来减少分类器对于不平衡数据集的偏见。显著的是，实验结果表明，我们的分类器在漏洞检测方面优于现有技术，具有更少的假阴性和假阳性。在多个数据集上测试我们的模型后，最好情况下显示出至少2.41％和18.75％的改进。使用N-day程序样本进行评估表明，我们提出的方法达到了93％的准确度，并能够从GitHub流行代码仓库中检测出4个零日漏洞。