The presence of software vulnerabilities is an ever-growing issue in software development. In most cases, it is desirable to detect vulnerabilities as early as possible, preferably in a just-in-time manner, when the vulnerable piece is added to the code base. The industry has a hard time combating this problem as manual inspection is costly and traditional means, such as rule-based bug detection, are not robust enough to follow the pace of the emergence of new vulnerabilities. The actively researched field of machine learning could help in such situations as models can be trained to detect vulnerable patterns. However, machine learning models work well only if the data is appropriately represented. In our work, we propose a novel way of representing changes in source code (i.e. code commits), the Code Change Tree, a form that is designed to keep only the differences between two abstract syntax trees of Java source code. We compared its effectiveness in predicting if a code change introduces a vulnerability against multiple representation types and evaluated them by a number of machine learning models as a baseline. The evaluation is done on a novel dataset that we published as part of our contributions using a 2-phase dataset generator method. Based on our evaluation we concluded that using Code Change Tree is a valid and effective choice to represent source code changes as it improves performance.
翻译:软件漏洞的存在是软件开发中不断增长的问题。在大多数情况下,尽早检测漏洞是可取的,最好是在漏洞存在的代码被添加到代码库中时进行即时检测。工业界在解决这个问题时遇到了困难,因为手工检查代价高昂,传统手段如基于规则的漏洞检测并不足以跟上新漏洞的涌现速度。机器学习领域能够在这种情况下帮助,因为可以训练模型来检测漏洞模式。然而,机器学习模型只有数据被正确表示才能发挥良好的作用。在我们的工作中,我们提出了一种新颖的表示源代码变更方式:代码改变树。这种形式旨在仅保留Java源代码的两个抽象语法树之间的差异。我们将其与多种表示类型的效果进行了比较,并通过多种机器学习模型作为基准进行了评估。我们使用二阶段数据集生成器方法评估了这个方法,并发布了一个新的数据集,基于我们的评估,我们得出结论:使用代码改变树作为源代码变更的有效表示方法可以提高性能。