Intrusion detection systems (IDS) are used to monitor networks or systems for attack activity or policy violations. Such a system should be able to successfully identify anomalous deviations from normal traffic behavior. Here we discuss the machine learning approach to building an anomaly-based IDS using the CSE-CIC-IDS2018 dataset. Since the publication of this dataset a relatively large number of papers have been published, most of them presenting IDS architectures and results based on complex machine learning methods, like deep neural networks, gradient boosting classifiers, or hidden Markov models. Here we show that similar results can be obtained using a very simple nearest neighbor classification approach, avoiding the inherent complications of training such complex models. The advantages of the nearest neighbor algorithm are: (1) it is very simple to implement; (2) it is extremely robust; (3) it has no parameters, and therefore it cannot overfit the data. This result also shows that currently there is a trend of developing over-engineered solutions in the machine learning community. Such solutions are based on complex methods, like deep learning neural networks, without even considering baseline solutions corresponding to simple, but efficient methods.
翻译:入侵探测系统(IDS)用来监测攻击活动或违反政策行为的网络或系统。 这样的系统应该能够成功地识别出与正常交通行为异常的偏差。 我们在这里讨论使用 CSE- CIC- IDS2018 数据集建立异常的 IDS的机器学习方法。 自从发布该数据集以来,已经发表了数量相对较多的论文,其中多数以复杂的机器学习方法(如深神经网络、梯度推进分类器或隐藏的Markov 模型)为基础, 展示了 IDS 的架构和结果。 我们在这里显示, 类似的结果可以用非常简单的近邻分类方法获得, 避免培训这类复杂模型的内在复杂问题。 最近的邻居算法的优点是:(1) 执行非常简单;(2) 极其稳健;(3) 没有参数, 因此无法过度配置数据。 这还表明, 目前有一种在机器学习界开发过度设计的解决办法的趋势。 这种解决办法基于复杂的方法, 如深层学习的神经网络, 甚至不考虑与简单但有效的方法相适应的基线解决方案。