数据处理方法分类平衡分类基准基准</s> (Benchmark of Data Preprocessing Methods for Imbalanced Classification)

Severe class imbalance is one of the main conditions that make machine learning in cybersecurity difficult. A variety of dataset preprocessing methods have been introduced over the years. These methods modify the training dataset by oversampling, undersampling or a combination of both to improve the predictive performance of classifiers trained on this dataset. Although these methods are used in cybersecurity occasionally, a comprehensive, unbiased benchmark comparing their performance over a variety of cybersecurity problems is missing. This paper presents a benchmark of 16 preprocessing methods on six cybersecurity datasets together with 17 public imbalanced datasets from other domains. We test the methods under multiple hyperparameter configurations and use an AutoML system to train classifiers on the preprocessed datasets, which reduces potential bias from specific hyperparameter or classifier choices. Special consideration is also given to evaluating the methods using appropriate performance measures that are good proxies for practical performance in real-world cybersecurity systems. The main findings of our study are: 1) Most of the time, a data preprocessing method that improves classification performance exists. 2) Baseline approach of doing nothing outperformed a large portion of methods in the benchmark. 3) Oversampling methods generally outperform undersampling methods. 4) The most significant performance gains are brought by the standard SMOTE algorithm and more complicated methods provide mainly incremental improvements at the cost of often worse computational performance.

翻译：严重阶级不平衡是使机器难以在网络安全中学习的主要条件之一。多年来,采用了各种各样的数据集预处理方法。这些方法通过过度抽样、抽样不足或两者兼而有之来修改培训数据集,以提高接受过该数据集培训的分类员的预测性能。虽然这些方法有时用于网络安全,但缺乏一个全面、公正的基准,比较其在各种网络安全问题上的绩效。本文是六个网络安全数据集的16个预处理方法的基准,以及17个来自其他领域的公开不平衡数据集。我们测试了多个超参数配置下的方法,并使用自动ML系统培训分类人员使用预处理数据集,这减少了特定超参数或分类者选择的潜在偏差。还特别考虑采用适当的业绩计量评估方法,这些方法对于实体网络网络安全系统的实际绩效业绩具有良好的联系。我们研究的主要结论是:(1) 大部分时间,存在一种改进分类绩效的预处理方法。(2) 在基准中,我们测试方法的基线方法比大部分方法优,并使用Aut-MLML系统对预处理数据集进行培训,这减少了特定超标参数或分类方法的潜在偏差。还特别考虑采用较复杂的计算方法。3 普遍地将改进了业绩方法。</s>