Machine Learning-based supervised approaches require highly customized and fine-tuned methodologies to deliver outstanding performance. This paper presents a dataset-driven design and performance evaluation of a machine learning classifier for the network intrusion dataset UNSW-NB15. Analysis of the dataset suggests that it suffers from class representation imbalance and class overlap in the feature space. We employed ensemble methods using Balanced Bagging (BB), eXtreme Gradient Boosting (XGBoost), and Random Forest empowered by Hellinger Distance Decision Tree (RF-HDDT). BB and XGBoost are tuned to handle the imbalanced data, and Random Forest (RF) classifier is supplemented by the Hellinger metric to address the imbalance issue. Two new algorithms are proposed to address the class overlap issue in the dataset. These two algorithms are leveraged to help improve the performance of the testing dataset by modifying the final classification decision made by three base classifiers as part of the ensemble classifier which employs a majority vote combiner. The proposed design is evaluated for both binary and multi-category classification. Comparing the proposed model to those reported on the same dataset in the literature demonstrate that the proposed model outperforms others by a significant margin for both binary and multi-category classification cases.
翻译:本文介绍了网络入侵数据集UNSW-NB15的机器学习分类的数据集驱动设计和性能评估。对数据集的分析表明,该数据集存在阶级代表性不平衡和特性空间中的阶级重叠问题。我们采用了使用平衡键(BBB)、eXtreme Gradient Abouting(XGBost)和由Hellinger远程决策树(RF-HDDDT)授权的随机森林的最后分类决定来帮助改进测试数据集的性能。BB和XGBoost被调整为处理不平衡数据,随机森林分类(RF)由海灵格指标补充,以解决不平衡问题。建议采用两种新的算法来解决数据集中的阶级重叠问题。我们利用这两种算法来帮助改进测试数据集的性能,办法是修改三个基础分类师所作的最后分类决定,作为使用多数选票组合体(RF-HDDDDDT)的一部分。拟议设计是为了处理不平衡的数据,而随机森林分类(RF)分类则由Hellingerm Fort Form (RF) 标准加以补充,以解决不平衡问题。提议采用两种模式,用一种新的算法模型来为同一分类中的重要模型,用来比值,用其他的模型来显示的比值,用其他的模型来显示同一分类法。