Identifying vulnerabilities in the source code is essential to protect the software systems from cyber security attacks. It, however, is also a challenging step that requires specialized expertise in security and code representation. Inspired by the successful applications of pre-trained programming language (PL) models such as CodeBERT and graph neural networks (GNNs), we propose ReGVD, a general and novel graph neural network-based model for vulnerability detection. In particular, ReGVD views a given source code as a flat sequence of tokens and then examines two effective methods of utilizing unique tokens and indexes respectively to construct a single graph as an input, wherein node features are initialized only by the embedding layer of a pre-trained PL model. Next, ReGVD leverages a practical advantage of residual connection among GNN layers and explores a beneficial mixture of graph-level sum and max poolings to return a graph embedding for the given source code. Experimental results demonstrate that ReGVD outperforms the existing state-of-the-art models and obtain the highest accuracy on the real-world benchmark dataset from CodeXGLUE for vulnerability detection.
翻译:在源代码中查明脆弱性对于保护软件系统免遭网络安全攻击至关重要,但这也是一个具有挑战性的步骤,需要安全和代码代表方面的专业知识。在经过事先训练的编程语言模型(PL)成功应用的启发下,例如代码BERT和图形神经网络(GNNS),我们提议“ReGVD”,这是一个通用和新颖的图形神经网络模型,用于识别脆弱性。特别是,ReGVD将特定源代码视为一个固定的标志序列,然后研究两种有效的方法,即分别使用独特的符号和索引来构建一个单一的图表作为输入,其中节点特征只能通过预先训练的编程模型的嵌入层来初始化。接下来,ReGVD利用了GNN各层剩余连接的实际优势,并探索了一种有益的图形级总和最大组合组合,以返回某个源代码嵌入的图表。实验结果表明,REGVD超越了现有的最新模型,并获得了从代码XLUE中检索真实世界基准数据集的最高精确度。