与DIFFI一起解释性异常探测:基于深度的隔离森林地貌重要性 (Interpretable Anomaly Detection with DIFFI: Depth-based Isolation Forest Feature Importance)

Anomaly Detection is an unsupervised learning task aimed at detecting anomalous behaviours with respect to historical data. In particular, multivariate Anomaly Detection has an important role in many applications thanks to the capability of summarizing the status of a complex system or observed phenomenon with a single indicator (typically called `Anomaly Score') and thanks to the unsupervised nature of the task that does not require human tagging. The Isolation Forest is one of the most commonly adopted algorithms in the field of Anomaly Detection, due to its proven effectiveness and low computational complexity. A major problem affecting Isolation Forest is represented by the lack of interpretability, an effect of the inherent randomness governing the splits performed by the Isolation Trees, the building blocks of the Isolation Forest. In this paper we propose effective, yet computationally inexpensive, methods to define feature importance scores at both global and local level for the Isolation Forest. Moreover, we define a procedure to perform unsupervised feature selection for Anomaly Detection problems based on our interpretability method; such procedure also serves the purpose of tackling the challenging task of feature importance evaluation in unsupervised anomaly detection. We assess the performance on several synthetic and real-world datasets, including comparisons against state-of-the-art interpretability techniques, and make the code publicly available to enhance reproducibility and foster research in the field.

翻译：异常探测是一项未经监督的学习任务,旨在发现历史数据方面异常行为; 特别是,多变异常探测在许多应用中具有重要作用,因为能够用单一指标(通常称为“异常分数”)来总结复杂系统或观察到的现象的状况,或观察到的现象,并有一个单一指标(通常称为“异常分数”),而且由于任务性质未经监督,不需要人做人类标记,因此,异常探测是一种未经监督的学习任务,旨在发现历史数据方面的异常探测行为;隔离森林是异常探测领域最常用的算法之一,因为它证明是有效的,计算复杂性低。影响隔离森林的一个主要问题体现在缺乏解释性,这是影响隔离森林的许多应用中的一个重要问题。由于能够用单一指标(通常称为“异常分数”)来总结复杂系统或观察到的现象,因此,多变异异性探测能力在许多应用中,这是孤立森林的构件。在本文中,我们提出有效但计算成本低廉的方法,用以界定全球和地方两级的隔离森林的重要分数。此外,我们还根据我们的可解释性分析方法,界定一种程序,对异常探测问题进行现有特征选择特征选择的程序;这种程序还用于实地比较性比较,包括公开地评估,提高世界性研究重要性的特性评估。

相关内容

异常检测

关注 102

在数据挖掘中，异常检测（英语：anomaly detection）对不符合预期模式或数据集中其他项目的项目、事件或观测值的识别。通常异常项目会转变成银行欺诈、结构缺陷、医疗问题、文本错误等类型的问题。异常也被称为离群值、新奇、噪声、偏差和例外。特别是在检测滥用与网络入侵时，有趣性对象往往不是罕见对象，但却是超出预料的突发活动。这种模式不遵循通常统计定义中把异常点看作是罕见对象，于是许多异常检测方法（特别是无监督的方法）将对此类数据失效，除非进行了合适的聚集。相反，聚类分析算法可能可以检测出这些模式形成的微聚类。有三大类异常检测方法。[1] 在假设数据集中大多数实例都是正常的前提下，无监督异常检测方法能通过寻找与其他数据最不匹配的实例来检测出未标记测试数据的异常。监督式异常检测方法需要一个已经被标记“正常”与“异常”的数据集，并涉及到训练分类器（与许多其他的统计分类问题的关键区别是异常检测的内在不均衡性）。半监督式异常检测方法根据一个给定的正常训练数据集创建一个表示正常行为的模型，然后检测由学习模型生成的测试实例的可能性。

【上海交大】可解释CNN的对象分类，Interpretable CNNs for Object Classification

专知会员服务

54+阅读 · 2020年3月14日

【上海交通大学-张拳石】可解释CNN，Interpretable CNNs for Object Classification

专知会员服务

46+阅读 · 2020年3月13日

《可解释的机器学习-interpretable-ml》238页pdf

专知会员服务

208+阅读 · 2020年2月24日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日