The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take $O(n\log n)$ work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case, making them inefficient for large datasets. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with hyper-threading show that we outperform existing parallel DBSCAN implementations by up to several orders of magnitude, and achieve speedups by up to 33x over the best sequential algorithms.
翻译:DBSCAN 空间集群方法因其适用于各种数据分析任务而受到极大关注。在Euclidean 的Euclidean 空间,DBSCAN有快速的顺序算法,需要花费O(n\log n)美元,用于两个维度,即三个或三个以上维度的次赤道工作,并可在线性工作中对任何不变的维度进行大致计算。然而,现有的DBSCAN 平行算法要求在最坏的情况下进行四级工作,使其在大型数据集和参数设置方面效率低下。本文弥合了平行DBSCAN的理论和实践之间的差距,为Euclidean 精确的 DBSCAN和大致的DBSCAN提出了新的平行算法,这些算法与其相继对应的对应方的工作界限相匹配,而且高度平行(pologlicrical 深度)。我们介绍了我们的算法的执行情况,同时优化了它们的实际性。我们对各种数据集和参数设置进行了全面实验性评估。我们用超高读的36核心机器的实验显示我们超越了现有的DBSCAN现有平行的平行执行系统,达到33级,达到最高级,并达到33级,实现速度。