通过 K- Means 扩展隔离森林, 用于在大数据中异常探测 (Extending Isolation Forest for Anomaly Detection in Big Data via K-Means)

Industrial Information Technology (IT) infrastructures are often vulnerable to cyberattacks. To ensure security to the computer systems in an industrial environment, it is required to build effective intrusion detection systems to monitor the cyber-physical systems (e.g., computer networks) in the industry for malicious activities. This paper aims to build such intrusion detection systems to protect the computer networks from cyberattacks. More specifically, we propose a novel unsupervised machine learning approach that combines the K-Means algorithm with the Isolation Forest for anomaly detection in industrial big data scenarios. Since our objective is to build the intrusion detection system for the big data scenario in the industrial domain, we utilize the Apache Spark framework to implement our proposed model which was trained in large network traffic data (about 123 million instances of network traffic) stored in Elasticsearch. Moreover, we evaluate our proposed model on the live streaming data and find that our proposed system can be used for real-time anomaly detection in the industrial setup. In addition, we address different challenges that we face while training our model on large datasets and explicitly describe how these issues were resolved. Based on our empirical evaluation in different use-cases for anomaly detection in real-world network traffic data, we observe that our proposed system is effective to detect anomalies in big data scenarios. Finally, we evaluate our proposed model on several academic datasets to compare with other models and find that it provides comparable performance with other state-of-the-art approaches.

翻译：工业信息技术基础设施往往容易受到网络攻击。为了确保工业环境中计算机系统的安全,需要建立有效的入侵探测系统,以监测工业中的恶意活动;本文件旨在建立这种入侵探测系统,以保护计算机网络免遭网络攻击;更具体地说,我们提议一种新型的未经监督的机器学习方法,将K-Means算法与隔离森林法结合起来,以便在工业大数据假设情景中发现异常现象。由于我们的目标是为工业领域的大数据情景建立入侵探测系统,我们需要利用Apache Spark框架来实施我们提议的模型,该模型在储存于Elestical研究的大型网络流量数据(约1.23亿次网络流量)方面受过培训。此外,我们评价了我们关于实时流数据数据的拟议模型,发现我们的系统可用于在工业结构中实时发现异常现象。此外,我们处理我们面临的不同挑战,同时在大型数据集方面培训我们的模型,并明确描述这些问题是如何解决的。根据我们的经验评估,在不同的网络数据流量数据(约1.23)中,我们用不同的实验性评估了我们的拟议数据模型,我们最后用不同的系统检测了其他的系统,在测试系统中,我们用其他的模型来评估了我们提出的数据异常现象。

相关内容

异常检测

关注 102

在数据挖掘中，异常检测（英语：anomaly detection）对不符合预期模式或数据集中其他项目的项目、事件或观测值的识别。通常异常项目会转变成银行欺诈、结构缺陷、医疗问题、文本错误等类型的问题。异常也被称为离群值、新奇、噪声、偏差和例外。特别是在检测滥用与网络入侵时，有趣性对象往往不是罕见对象，但却是超出预料的突发活动。这种模式不遵循通常统计定义中把异常点看作是罕见对象，于是许多异常检测方法（特别是无监督的方法）将对此类数据失效，除非进行了合适的聚集。相反，聚类分析算法可能可以检测出这些模式形成的微聚类。有三大类异常检测方法。[1] 在假设数据集中大多数实例都是正常的前提下，无监督异常检测方法能通过寻找与其他数据最不匹配的实例来检测出未标记测试数据的异常。监督式异常检测方法需要一个已经被标记“正常”与“异常”的数据集，并涉及到训练分类器（与许多其他的统计分类问题的关键区别是异常检测的内在不均衡性）。半监督式异常检测方法根据一个给定的正常训练数据集创建一个表示正常行为的模型，然后检测由学习模型生成的测试实例的可能性。

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【经典书】C语言傻瓜式入门（第二版），411页pdf

专知会员服务

54+阅读 · 2020年8月16日

【CMU博士论文】使用静态和动态图来异常检测，Mining Anomalies using Static and Dynamic Graphs

专知会员服务

68+阅读 · 2020年5月26日