Identifying anomalies in multi-dimensional datasets is an important task in many real-world applications. A special case arises when anomalies are occluded in a small set of attributes, typically referred to as a subspace, and not necessarily over the entire data space. In this paper, we propose a new subspace analysis approach named Agglomerative Attribute Grouping (AAG) that aims to address this challenge by searching for subspaces that are comprised of highly correlative attributes. Such correlations among attributes represent a systematic interaction among the attributes that can better reflect the behavior of normal observations and hence can be used to improve the identification of two particularly interesting types of abnormal data samples: anomalies that are occluded in relatively small subsets of the attributes and anomalies that represent a new data class. AAG relies on a novel multi-attribute measure, which is derived from information theory measures of partitions, for evaluating the "information distance" between groups of data attributes. To determine the set of subspaces to use, AAG applies a variation of the well-known agglomerative clustering algorithm with the proposed multi-attribute measure as the underlying distance function. Finally, the set of subspaces is used in an ensemble for anomaly detection. Extensive evaluation demonstrates that, in the vast majority of cases, the proposed AAG method (i) outperforms classical and state-of-the-art subspace analysis methods when used in anomaly detection ensembles, and (ii) generates fewer subspaces with a fewer number of attributes each (on average), thus resulting in a faster training time for the anomaly detection ensemble. Furthermore, in contrast to existing methods, the proposed AAG method does not require any tuning of parameters.
翻译:在许多现实世界应用中,识别多维数据集中的异常是一个重要任务。 当异常被隐藏在一小组特征中,通常被称为子空间,而不一定是整个数据空间。 在本文中,我们提议一个新的子空间分析方法,名为集合属性组(AAAG),目的是通过搜索由高度关联属性组成的子空间来应对这一挑战。这种属性之间的关联代表着各个属性之间的系统互动,可以更好地反映正常观测的参数行为,从而可以用来改进两种特别有趣的异常数据样本类型的识别:在代表新数据类的属性和异常的相对小组中隐含的异常。在本文中,我们建议采用一个新的子空间分析方法,该方法来自分区的信息理论测量,用于评估数据属性组之间的“信息距离”。为了确定要使用的子空间组的设置,AAAG应用一个已知的平均组合算法,与提议的多属性组样本样本样本样本样本样本测量法相比,这两类异常样本样本中的异常样本检测方法特别有趣。 用于高级智能检测的亚空格分析方法,因此, 用于高级检测的亚空格中的任何子分析方法。