Industrial Information Technology (IT) infrastructures are often vulnerable to cyberattacks. To ensure security to the computer systems in an industrial environment, it is required to build effective intrusion detection systems to monitor the cyber-physical systems (e.g., computer networks) in the industry for malicious activities. This paper aims to build such intrusion detection systems to protect the computer networks from cyberattacks. More specifically, we propose a novel unsupervised machine learning approach that combines the K-Means algorithm with the Isolation Forest for anomaly detection in industrial big data scenarios. Since our objective is to build the intrusion detection system for the big data scenario in the industrial domain, we utilize the Apache Spark framework to implement our proposed model which was trained in large network traffic data (about 123 million instances of network traffic) stored in Elasticsearch. Moreover, we evaluate our proposed model on the live streaming data and find that our proposed system can be used for real-time anomaly detection in the industrial setup. In addition, we address different challenges that we face while training our model on large datasets and explicitly describe how these issues were resolved. Based on our empirical evaluation in different use-cases for anomaly detection in real-world network traffic data, we observe that our proposed system is effective to detect anomalies in big data scenarios. Finally, we evaluate our proposed model on several academic datasets to compare with other models and find that it provides comparable performance with other state-of-the-art approaches.
翻译:工业信息技术基础设施往往容易受到网络攻击。为了确保工业环境中计算机系统的安全,需要建立有效的入侵探测系统,以监测工业中的恶意活动;本文件旨在建立这种入侵探测系统,以保护计算机网络免遭网络攻击;更具体地说,我们提议一种新型的未经监督的机器学习方法,将K-Means算法与隔离森林法结合起来,以便在工业大数据假设情景中发现异常现象。由于我们的目标是为工业领域的大数据情景建立入侵探测系统,我们需要利用Apache Spark框架来实施我们提议的模型,该模型在储存于Elestical研究的大型网络流量数据(约1.23亿次网络流量)方面受过培训。此外,我们评价了我们关于实时流数据数据的拟议模型,发现我们的系统可用于在工业结构中实时发现异常现象。此外,我们处理我们面临的不同挑战,同时在大型数据集方面培训我们的模型,并明确描述这些问题是如何解决的。根据我们的经验评估,在不同的网络数据流量数据(约1.23)中,我们用不同的实验性评估了我们的拟议数据模型,我们最后用不同的系统检测了其他的系统,在测试系统中,我们用其他的模型来评估了我们提出的数据异常现象。