学习我们所知道的知识:如何利用吵闹的历史数据进行脆弱性预测 (Learning from what we know: How to perform vulnerability prediction using noisy historical data)

from arxiv, Please do not consider this new version of the article for citations. The article (with its previous versions) is already available here: arXiv:2012.11701

Vulnerability prediction refers to the problem of identifying system components that are most likely to be vulnerable. Typically, this problem is tackled by training binary classifiers on historical data. Unfortunately, recent research has shown that such approaches underperform due to the following two reasons: a) the imbalanced nature of the problem, and b) the inherently noisy historical data, i.e., most vulnerabilities are discovered much later than they are introduced. This misleads classifiers as they learn to recognize actual vulnerable components as non-vulnerable. To tackle these issues, we propose TROVON, a technique that learns from known vulnerable components rather than from vulnerable and non-vulnerable components, as typically performed. We perform this by contrasting the known vulnerable, and their respective fixed components. This way, TROVON manages to learn from the things we know, i.e., vulnerabilities, hence reducing the effects of noisy and unbalanced data. We evaluate TROVON by comparing it with existing techniques on three security-critical open source systems, i.e., Linux Kernel, OpenSSL, and Wireshark, with historical vulnerabilities that have been reported in the National Vulnerability Database (NVD). Our evaluation demonstrates that the prediction capability of TROVON significantly outperforms existing vulnerability prediction techniques such as Software Metrics, Imports, Function Calls, Text Mining, Devign, LSTM, and LSTM-RF with an improvement of 40.84% in Matthews Correlation Coefficient (MCC) score under Clean Training Data Settings, and an improvement of 35.52% under Realistic Training Data Settings.

翻译：脆弱性预测是指识别系统最有可能脆弱的组成部分的问题。通常, 这个问题通过培训历史数据方面的二分级人员来解决。不幸的是, 最近的研究显示, 这种方法之所以表现不佳,原因有以下两个:(a) 问题的不平衡性质, 以及(b) 内在噪音的历史数据, 即大多数脆弱性的发现比引入的时间要晚得多。这误导了分类人员,因为他们学会了承认实际的脆弱组成部分是不可忽略的。为了解决这些问题,我们建议了TROVON, 这是一种通常从已知的脆弱组成部分而不是脆弱和不可忽略的组成部分学习的技术。我们通过对比已知的脆弱组成部分及其各自的固定组成部分来做到这一点。这样,TROVON学会学会从我们所知道的事物中学习, 也就是脆弱性, 也就是说,大多数脆弱和不平衡的数据。我们通过将TROVON与三种安全临界开放源系统的现有技术进行比较, 即 Linux Kernel, OpenSSL, 和 Werresershark, 其历史脆弱性预测能力在LOVS 数据库中, 已经明显地展示了目前的脆弱性。