Many of the proposed machine learning (ML) based network intrusion detection systems (NIDSs) achieve near perfect detection performance when evaluated on synthetic benchmark datasets. Though, there is no record of if and how these results generalise to other network scenarios, in particular to real-world networks. In this paper, we investigate the generalisability property of ML-based NIDSs by extensively evaluating seven supervised and unsupervised learning models on four recently published benchmark NIDS datasets. Our investigation indicates that none of the considered models is able to generalise over all studied datasets. Interestingly, our results also indicate that the generalisability has a high degree of asymmetry, i.e., swapping the source and target domains can significantly change the classification performance. Our investigation also indicates that overall, unsupervised learning methods generalise better than supervised learning models in our considered scenarios. Using SHAP values to explain these results indicates that the lack of generalisability is mainly due to the presence of strong correspondence between the values of one or more features and Attack/Benign classes in one dataset-model combination and its absence in other datasets that have different feature distributions.
翻译:在对合成基准数据集进行评估时,许多拟议的机器学习(ML)网络入侵探测系统(NIDS)在对合成基准数据集进行评估时几乎完全能够检测到检测性能。虽然没有记录表明这些结果是否以及如何概括到其他网络情景,特别是真实世界网络。在本文件中,我们通过对最近公布的4个基准NIDS数据集的7个监督和不受监督的学习模型进行广泛评估,调查了基于ML的网络入侵探测系统(NIDS)的一般性属性。我们的调查表明,考虑的模型中没有一个能够对所有研究过的数据集进行概括化。有趣的是,我们的结果还表明,普遍性的高度不对称性,即将源和目标领域互换,可以显著改变分类性绩效。我们的调查还表明,总体而言,未经监督的学习方法比我们所考虑的情景中受监督的学习模型要好。我们利用SHAP的数值来解释这些结果,缺乏普遍性的主要原因是,一个或一个以上特征和攻击/Benign 类的数值在一个数据集组合中存在强烈的对应性,而它在其他数据集中缺乏不同的特征分布。