The sources of reliable, code-level information about vulnerabilities that affect open-source software (OSS) are scarce, which hinders a broad adoption of advanced tools that provide code-level detection and assessment of vulnerable OSS dependencies. In this paper, we study the extent to which the output of off-the-shelf static code analyzers can be used as a source of features to represent commits in Machine Learning (ML) applications. In particular, we investigate how such features can be used to construct embeddings and train ML models to automatically identify source code commits that contain vulnerability fixes. We analyze such embeddings for security-relevant and non-security-relevant commits, and we show that, although in isolation they are not different in a statistically significant manner, it is possible to use them to construct a ML pipeline that achieves results comparable with the state of the art. We also found that the combination of our method with commit2vec represents a tangible improvement over the state of the art in the automatic identification of commits that fix vulnerabilities: the ML models we construct and commit2vec are complementary, the former being more generally applicable, albeit not as accurate.
翻译:有关影响开放源码软件的脆弱性的可靠、代码级信息来源稀缺,妨碍了广泛采用先进的工具,提供代码级的检测和评估脆弱的开放源码软件依赖性。在本文件中,我们研究了现成静态代码分析器的产出在多大程度上可以用作机器学习应用中体现其承诺的特征的来源。我们特别调查这些特征如何用于构建嵌入并培训ML模型,以自动识别含有脆弱性修正的源码。我们分析了安全相关和非安全相关承诺的这种嵌入,我们分析这些嵌入在安全相关和非安全相关承诺方面是相辅相成的,我们表明,尽管孤立地这些嵌入在具有统计意义的情况下并不不同,但有可能使用它们来构建一个ML管道,其结果与艺术状况相仿。我们还发现,我们的方法与承诺2vec相结合,表明在自动识别确定确定确定确定脆弱性承诺方面的情况有了明显改善:我们构建和承诺2vec的ML模型是相辅相成的,前者是普遍适用的,尽管不准确。