Automatically locating vulnerable statements in source code is crucial to assure software security and alleviate developers' debugging efforts. This becomes even more important in today's software ecosystem, where vulnerable code can flow easily and unwittingly within and across software repositories like GitHub. Across such millions of lines of code, traditional static and dynamic approaches struggle to scale. Although existing machine-learning-based approaches look promising in such a setting, most work detects vulnerable code at a higher granularity -- at the method or file level. Thus, developers still need to inspect a significant amount of code to locate the vulnerable statement(s) that need to be fixed. This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph and effectively understand code semantics and vulnerable patterns. To study VELVET's effectiveness, we use an off-the-shelf synthetic dataset and a recently published real-world dataset. In the static analysis setting, where vulnerable functions are not detected in advance, VELVET achieves 4.5x better performance than the baseline static analyzers on the real-world data. For the isolated vulnerability localization task, where we assume the vulnerability of a function is known while the specific vulnerable statement is unknown, we compare VELVET with several neural networks that also attend to local and global context of code. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively, outperforming the baseline deep-learning models by 5.3-29.0%.
翻译:在源代码中自动定位脆弱的语句对于确保软件安全并减轻开发者的调试努力至关重要。 在当今的软件生态系统中,这一点变得更加重要,因为脆弱的代码可以在GitHub等软件库内部和相互之间容易和不知情地流动。在成百万条代码中,传统的静态和动态方法挣扎着规模。尽管基于机器学习的现有方法在这种环境中看起来很有希望,但大多数工作在方法或文件层面在更高的颗粒度上检测到脆弱的代码。因此,开发者仍然需要检查大量代码,以确定需要固定的脆弱语句。在今天的软件生态系统中,脆弱的代码可以很容易和不知情地在诸如GitHub等软件库中流动。我们模型将基于图表和基于序列的神经网络结合起来,以便成功地捕捉到一个程序图的当地和全球背景,并有效地理解代码的语系和脆弱模式。我们使用一个现成的合成合成数据集,以及最近出版的一个真实世界数据集。在静态分析中,脆弱语系的功能不是在预先检测的 VELET,VLeveET最高数据运行状态,而我们使用一个已知的精确性数据功能则是在特定的基值上。