Anomaly Detection is an unsupervised learning task aimed at detecting anomalous behaviours with respect to historical data. In particular, multivariate Anomaly Detection has an important role in many applications thanks to the capability of summarizing the status of a complex system or observed phenomenon with a single indicator (typically called `Anomaly Score') and thanks to the unsupervised nature of the task that does not require human tagging. The Isolation Forest is one of the most commonly adopted algorithms in the field of Anomaly Detection, due to its proven effectiveness and low computational complexity. A major problem affecting Isolation Forest is represented by the lack of interpretability, an effect of the inherent randomness governing the splits performed by the Isolation Trees, the building blocks of the Isolation Forest. In this paper we propose effective, yet computationally inexpensive, methods to define feature importance scores at both global and local level for the Isolation Forest. Moreover, we define a procedure to perform unsupervised feature selection for Anomaly Detection problems based on our interpretability method; such procedure also serves the purpose of tackling the challenging task of feature importance evaluation in unsupervised anomaly detection. We assess the performance on several synthetic and real-world datasets, including comparisons against state-of-the-art interpretability techniques, and make the code publicly available to enhance reproducibility and foster research in the field.
翻译:异常探测是一项未经监督的学习任务,旨在发现历史数据方面异常行为; 特别是,多变异常探测在许多应用中具有重要作用,因为能够用单一指标(通常称为“异常分数”)来总结复杂系统或观察到的现象的状况,或观察到的现象,并有一个单一指标(通常称为“异常分数”),而且由于任务性质未经监督,不需要人做人类标记,因此,异常探测是一种未经监督的学习任务,旨在发现历史数据方面的异常探测行为;隔离森林是异常探测领域最常用的算法之一,因为它证明是有效的,计算复杂性低。 影响隔离森林的一个主要问题体现在缺乏解释性,这是影响隔离森林的许多应用中的一个重要问题。由于能够用单一指标(通常称为“异常分数”)来总结复杂系统或观察到的现象,因此,多变异异性探测能力在许多应用中,这是孤立森林的构件。 在本文中,我们提出有效但计算成本低廉的方法,用以界定全球和地方两级的隔离森林的重要分数。 此外,我们还根据我们的可解释性分析方法,界定一种程序,对异常探测问题进行现有特征选择特征选择的程序;这种程序还用于实地比较性比较,包括公开地评估,提高世界性研究重要性的特性评估。