Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expensive to acquire. Goal: To help security practitioners train useful security classification models when few labeled training data and many unlabeled training data are available. Method: We propose an adaptive framework called Dapper, which optimizes 1) semi-supervised learning algorithms to assign pseudo-labels to unlabeled data in a propagation paradigm and 2) the machine learning classifier (i.e., random forest). When the dataset class is highly imbalanced, Dapper then adaptively integrates and optimizes a data oversampling method called SMOTE. We use the novel Bayesian Optimization to search a large hyperparameter space of these tuning targets. Result: We evaluate Dapper with three security datasets, i.e., the Twitter spam dataset, the malware URLs dataset, and the CIC-IDS-2017 dataset. Experimental results indicate that we can use as low as 10% of original labeled data but achieve close or even better classification performance than using 100% labeled data in a supervised way. Conclusion: Based on those results, we would recommend using hyperparameter optimization with semi-supervised learning when dealing with shortages of labeled security data.
翻译:: 大多数现有的安保任务机器学习模式,如垃圾邮件检测、恶意软件检测或网络入侵检测等,都建在受监督的机器学习算法上。在这样的模式中,模型需要大量标签数据来学习选定特征和目标类之间的有用关系。然而,这类标签数据可能是稀缺的,而且获取成本很高。目标:当标签培训数据和许多未贴标签的培训数据时,为了帮助安保从业人员培训有用的安全分类模式。方法:我们提议了一个适应性框架,称为Dapper,它优化了1个半监督的学习算法,用于在传播范式和2中为未贴标签的数据指定假标签标签。在数据设置高度不平衡时,Dapper然后适应性地整合和优化一个叫SMOTE的抽样方法。我们使用新版的Bayesian Opimimation来搜索这些调整目标的大型超标准空间。结果:我们用三个安全数据集来评估Daper, i.e. e.,Twitter spard emarvisal deal develrial deal dal slavelop as the lavelop dal dalse as salse as.