This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
翻译:本文介绍了VulBERTA,这是在源代码中检测安全脆弱性的一种深层次的学习方法。我们的方法先是用开放源代码C/C+++项目在现实世界代码上定制代号的RoBERTA模型,从开放源代码C/C+++项目中学习了对代码语法和语义学的深刻了解,我们利用这些术语来培训脆弱性检测分类人员。我们评估了我们在若干数据集(Vuldeepecker、Draper、REVEAL和muVuldeepecker)和基准(CodeXGLUE和D2A)中的二进制和多级脆弱性检测任务的方法。评价结果显示,VulBERTA取得了最新业绩,超越了不同数据集的现有方法,尽管其概念简单,培训数据规模和模型参数数量成本有限。