群集的等级式异常和外异探测值 (Clustered Hierarchical Anomaly and Outlier Detection Algorithms)

Anomaly and outlier detection is a long-standing problem in machine learning. In some cases, anomaly detection is easy, such as when data are drawn from well-characterized distributions such as the Gaussian. However, when data occupy high-dimensional spaces, anomaly detection becomes more difficult. We present CLAM (Clustered Learning of Approximate Manifolds), a manifold mapping technique in any metric space. CLAM begins with a fast hierarchical clustering technique and then induces a graph from the cluster tree, based on overlapping clusters as selected using several geometric and topological features. Using these graphs, we implement CHAODA (Clustered Hierarchical Anomaly and Outlier Detection Algorithms), exploring various properties of the graphs and their constituent clusters to find outliers. CHAODA employs a form of transfer learning based on a training set of datasets, and applies this knowledge to a separate test set of datasets of different cardinalities, dimensionalities, and domains. On 24 publicly available datasets, we compare CHAODA (by measure of ROC AUC) to a variety of state-of-the-art unsupervised anomaly-detection algorithms. Six of the datasets are used for training. CHAODA outperforms other approaches on 16 of the remaining 18 datasets. CLAM and CHAODA scale to large, high-dimensional "big data" anomaly-detection problems, and generalize across datasets and distance functions. Source code to CLAM and CHAODA are freely available on GitHub at https://github.com/URI-ABD/clam.

翻译：异常和异常的探测是机器学习中长期存在的一个问题。在某些情况下, 异常的检测是很容易的, 比如从像高山这样的特征清晰的分布中提取数据。但是, 当数据占据高维空间时, 异常的检测会变得更加困难。我们展示了 CLAM( 光学光学), 这是任何测量空间中的一种多重绘图技术。 CLAM 以快速的等级分组技术为起点, 然后从组群树上引出一张图, 以使用若干几何和地貌特征选择的重叠组群为基础。我们使用这些图表, 执行 CHADAD( 精密的高正态异常和异端检测 Algorithms ), 探索图表及其组成组的特性以查找异端点。 CHADA( 测量高端高端的高端数据) 。 CHADA 使用一种基于数据集培训的传输学习方式, 并将这一知识应用于另外一组不同基础、度、维度和域的数据集。在24个公开数据集上, 我们用 CDAD( 用于 ROCA AS- dal- dal- dal- dal- dalvad数据的的用于 ASal- daldal- dalddalddaldaldaldalddd的数据。