Identifying vulnerabilities in the source code is essential to protect the software systems from cyber security attacks. It, however, is also a challenging step that requires specialized expertise in security and code representation. To this end, we aim to develop a general, practical, and programming language-independent model capable of running on various source codes and libraries without difficulty. Therefore, we consider vulnerability detection as an inductive text classification problem and propose ReGVD, a simple yet effective graph neural network-based model for the problem. In particular, ReGVD views each raw source code as a flat sequence of tokens to build a graph, wherein node features are initialized by only the token embedding layer of a pre-trained programming language (PL) model. ReGVD then leverages residual connection among GNN layers and examines a mixture of graph-level sum and max poolings to return a graph embedding for the source code. Experimental results demonstrate that ReGVD outperforms the existing state-of-the-art models and obtains the highest accuracy on the real-world benchmark dataset from CodeXGLUE for vulnerability detection. Our code is available at: \url{https://github.com/daiquocnguyen/GNN-ReGVD}.
翻译:在源代码中识别脆弱性对于保护软件系统免遭网络安全攻击至关重要,但这也是一个具有挑战性的步骤,需要安全和代码代表方面的专业知识。为此,我们的目标是开发一个通用的、实用的和编程上独立的语言模型,能够毫无困难地运行于各种源代码和图书馆。因此,我们认为脆弱性检测是一个感化文本分类问题,并提议使用一个简单而有效的图形神经网络模型ReGVD,这是一个简单的、有效的系统神经网络模型。特别是,ReGVD将每种原始源代码视为一个平坦的符号序列,用于构建一个图形,其中仅由事先培训的编程语言模型的象征性嵌入层来初始化节点特征。ReGVD随后利用GNN各层之间的剩余连接,并审查一组图形级总和最大集合的混合物,以返回源代码嵌入的图。实验结果显示,REGVD超越了现有状态的模型,并获得了用于脆弱性检测的代码中真实世界基准数据集的最高精确度。我们的代码可在以下查阅:NFNG/GNG=NG。