This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN$^*$). Our approach is based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We introduce a new notion of well-separation to reduce the work and space of our algorithm for HDBSCAN$^*$. We also present a parallel approximate algorithm for OPTICS based on a recent sequential algorithm by Gan and Tao. Finally, we give a new parallel divide-and-conquer algorithm for computing the dendrogram and reachability plots, which are used in visualizing clusters of different scale that arise for both EMST and HDBSCAN$^*$. We show that our algorithms are theoretically efficient: they have work (number of operations) matching their sequential counterparts, and polylogarithmic depth (parallel time). We implement our algorithms and propose a memory optimization that requires only a subset of well-separated pairs to be computed and materialized, leading to savings in both space (up to 10x) and time (up to 8x). Our experiments on large real-world and synthetic data sets using a 48-core machine show that our fastest algorithms outperform the best serial algorithms for the problems by 11.13--55.89x, and existing parallel algorithms by at least an order of magnitude.
翻译:本文展示了生成Euclidean最低横贯树木和空间集群等级的新的平行算法(称为HDBSCAN$ $ )。 我们的方法是基于生成一个分离良好的对配分分分解法,然后使用Kruskal的最小横贯树算法和双色最接近对配方计算法。 我们引入了一种井分解的新概念,以减少我们计算HDBSCAN$ $ 的算法的工作和空间。 我们还根据甘道和道夫最近的逐级算法,为OCBICS提供了一种平行的近似算法。 最后,我们给出了一个新的平行的分解和正弦算法,用于计算登盘和可达性方位图,然后用来对EMST和HDBSCAN$ $ $ 的各种不同规模的集群进行可视化组合。 我们展示了我们的算法在理论上是有效的:它们的工作(操作数量)与其相匹配, 和多式13的深度(平行时间) 。 我们实施我们的算法并提议一个记忆优化的缩略, 只需要一组精分离的组合, 来进行计算和质分解的组合, 来进行计算和可计算, 以48x的顺序的顺序的顺序, 以显示和材料化的顺序的顺序, 以显示的高度的顺序, 以显示为空间的顺序的顺序为速度的顺序的顺序和速度的高度的顺序, 和速度的顺序的顺序的顺序的顺序的顺序和速度的顺序的顺序的顺序的顺序的顺序, 。