There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for distributed OD techniques. This area, however, is not only understudied but also short of public-domain implementations for practical use. This paper aims to fill this gap: We design Sparx, a data-parallel OD algorithm suitable for shared-nothing infrastructures, which we specifically implement in Apache Spark. Through extensive experiments on three real-world datasets, with several billions of points and millions of features, we show that existing open-source solutions fail to scale up; either by large number of points or high dimensionality, whereas Sparx yields scalable and effective performance. To facilitate practical use of OD on modern-scale datasets, we open-source Sparx under the Apache license at https://tinyurl.com/sparx2022.
翻译:文献中并不缺少外部探测(OD)算法,但其中一大块是专为一台机器设计的。随着云中常住数据集的日益现实化,需要分布式OD技术。然而,这个领域不仅研究不足,而且缺乏供实际使用的公共领域实施。本文旨在填补这一空白:我们设计了数据-平行的Sparx,这是一个数据-平行的OD算法,适合共享-无用基础设施,我们在Apache Spark中具体实施。我们通过对三个真实世界数据集进行广泛的实验,其中含有数十亿个点和数百万个特征,我们表明现有的开放源解决方案未能扩大规模;要么是大量点或高度的,而Sparx产生可缩放和有效的性能。为了便利在现代规模的数据集上实际使用OD,我们在https://tinyurl.com/spalx2022的阿帕奇许可证下,我们开放源Sparx。