Deep learning (DL) techniques are on the rise in the software engineering research community. More and more approaches have been developed on top of DL models, also due to the unprecedented amount of software-related data that can be used to train these models. One of the recent applications of DL in the software engineering domain concerns the automatic detection of software vulnerabilities. While several DL models have been developed to approach this problem, there is still limited empirical evidence concerning their actual effectiveness especially when compared with shallow machine learning techniques. In this paper, we partially fill this gap by presenting a large-scale empirical study using three vulnerability datasets and five different source code representations (i.e., the format in which the code is provided to the classifiers to assess whether it is vulnerable or not) to compare the effectiveness of two widely used DL-based models and of one shallow machine learning model in (i) classifying code functions as vulnerable or non-vulnerable (i.e., binary classification), and (ii) classifying code functions based on the specific type of vulnerability they contain (or "clean", if no vulnerability is there). As a baseline we include in our study the AutoML utility provided by the Google Cloud Platform. Our results show that the experimented models are still far from ensuring reliable vulnerability detection, and that a shallow learning classifier represents a competitive baseline for the newest DL-based models.
翻译:深入学习(DL)技术在软件工程研究界中呈上升趋势。在DL模型之上,已经开发出越来越多的方法,这也是由于可用于培训这些模型的软件相关数据数量之多前所未有。DL在软件工程领域的最近应用之一涉及软件脆弱性的自动检测。虽然已经开发了几个DL模型来解决这一问题,但关于其实际有效性的经验证据仍然有限,特别是在与浅机学习技术相比的情况下。在本文件中,我们通过利用三个脆弱性数据集和五个不同的源代码表达形式(即向分类者提供代码以评估其是否脆弱的格式)进行大规模的经验研究,填补了这一空白。在比较两种广泛使用的DL模型和一种浅机学习模型的有效性方面,(一)将代码功能分为脆弱或不可破坏(即二),以及(二)根据它们包含的具体脆弱性类型(或者“清洁”,如果没有脆弱性的话)进行分类。作为基准,我们向分类者提供了代码,用以评估该代码是否脆弱的格式,以比较两种广泛使用的DL模型和一种浅体机学习模型的有效性。我们通过谷地研究将一个最可靠的数据化工具学习模型展示一个可靠的数据库。