Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.
翻译:外向检测是指确定不同于一般数据分布的数据点; 现有的未经监督的方法往往由于计算成本高、超参数调整复杂和解释有限,特别是在与大型高维数据集合作时,其计算成本高、超参数调高、解释性有限,特别是在使用大型高维数据集时。 为了解决这些问题,我们提出了一个简单而有效的算法,称为ECAD(基于模拟分布分布的外向检测),它受到以下事实的启发:外部线往往是分布尾部中出现的“极端事件”。 简言之,ECOD首先通过计算数据每个维度的经验累积分布,以非参数性的方式估计输入数据的基本分布。 ECOD随后利用这些经验分布来估计每个数据点每个维度的尾端概率。 最后,ECOD通过汇总估计的尾部概率,对每个数据点的外向分数进行比较。 我们的贡献如下:(1) 我们提出一种新型的外向外探测方法,即ECOD,它既无参数又易于解释;(2) 我们用30个基准性累积性累积性进行广泛的实验,我们发现11号的精确性,我们在那里找到了一个基准值的精确度。