Intrusion detection is an essential task in the cyber threat environment. Machine learning and deep learning techniques have been applied for intrusion detection. However, most of the existing research focuses on the model work but ignores the fact that poor data quality has a direct impact on the performance of a machine learning system. More attention should be paid to the data work when building a machine learning-based intrusion detection system. This article first summarizes existing machine learning-based intrusion detection systems and the datasets used for building these systems. Then the data preparation workflow and quality requirements for intrusion detection are discussed. To figure out how data and models affect machine learning performance, we conducted experiments on 11 HIDS datasets using seven machine learning models and three deep learning models. The experimental results show that BERT and GPT were the best algorithms for HIDS on all of the datasets. However, the performance on different datasets varies, indicating the differences between the data quality of these datasets. We then evaluate the data quality of the 11 datasets based on quality dimensions proposed in this paper to determine the best characteristics that a HIDS dataset should possess in order to yield the best possible result. This research initiates a data quality perspective for researchers and practitioners to improve the performance of machine learning-based intrusion detection.
翻译:入侵探测是网络威胁环境中的一项基本任务。机器学习和深层学习技术已被应用于入侵探测。然而,大多数现有研究侧重于模型工作,但忽略了数据质量差直接影响到机器学习系统的性能这一事实。在建立机器学习入侵探测系统时,应更多地注意数据工作。本篇文章首先总结了现有的机器学习入侵探测系统和用于建立这些系统的数据集。然后讨论了数据编制工作流程和入侵探测质量要求。为了查明数据和模型如何影响机器学习性能,我们利用7个机器学习模型和3个深层学习模型对11个HIDS数据集进行了实验。实验结果表明,BERT和GPT是HIDS所有数据集的最佳算法。然而,不同数据集的性能各有不同,表明这些数据集的数据质量差异。然后我们根据本文件提出的质量层面评估11个数据集的数据质量质量要求,以确定HIDS数据集应拥有的最佳特征,以便产生最佳的入侵探测结果。这一实验结果表明,机器研究人员将获得最佳的检测结果。