Background: Machine learning techniques have been widely used and demonstrate promising performance in many software security tasks such as software vulnerability prediction. However, the class ratio within software vulnerability datasets is often highly imbalanced (since the percentage of observed vulnerability is usually very low). Goal: To help security practitioners address software security data class imbalanced issues and further help build better prediction models with resampled datasets. Method: We introduce an approach called Dazzle which is an optimized version of conditional Wasserstein Generative Adversarial Networks with gradient penalty (cWGAN-GP). Dazzle explores the architecture hyperparameters of cWGAN-GP with a novel optimizer called Bayesian Optimization. We use Dazzle to generate minority class samples to resample the original imbalanced training dataset. Results: We evaluate Dazzle with three software security datasets, i.e., Moodle vulnerable files, Ambari bug reports, and JavaScript function code. We show that Dazzle is practical to use and demonstrates promising improvement over existing state-of-the-art oversampling techniques such as SMOTE (e.g., with an average of about 60% improvement rate over SMOTE in recall among all datasets). Conclusion: Based on this study, we would suggest the use of optimized GANs as an alternative method for security vulnerability data class imbalanced issues.
翻译:机床学习技术被广泛使用,并在许多软件安全任务(如软件脆弱性预测)中表现出有希望的绩效。然而,软件脆弱性数据集中的等级比率往往高度不平衡(因为观察到的脆弱性的百分比通常非常低)。 目标: 帮助安全从业人员解决软件安全数据分类不平衡的问题,并进一步帮助建立更好的预测模型,再版数据集。 方法: 我们引入了一种称为“ 喷嘴”的方法, 这是一种有条件的瓦瑟斯坦基因生成自动网络的优化版本, 带有梯度处罚(cWGAN-GP) 。 喷嘴用一种叫作Bayesian Oppimiz的新颖优化器探索了 WGAN-GP 的超参数。 我们用“ 喷嘴” 生成少数群体类样本, 以重新标出原始的不平衡培训数据集。 结果: 我们用三个软件安全数据集来评估“ 喷嘴”, 即: 面包易变弱文件、 Ambari 错误报告 和 JavaScript 函数代码。 我们显示, Dazlead是实用的,可以使用, 并展示有希望改进现有的“ 高校准” 过度取样技术,例如 SMOTE 的“ ” 标准, 的平均值, 将使用。