Obtaining scalable algorithms for hierarchical agglomerative clustering (HAC) is of significant interest due to the massive size of real-world datasets. At the same time, efficiently parallelizing HAC is difficult due to the seemingly sequential nature of the algorithm. In this paper, we address this issue and present ParHAC, the first efficient parallel HAC algorithm with sublinear depth for the widely-used average-linkage function. In particular, we provide a $(1+\epsilon)$-approximation algorithm for this problem on $m$ edge graphs using $\tilde{O}(m)$ work and poly-logarithmic depth. Moreover, we show that obtaining similar bounds for exact average-linkage HAC is not possible under standard complexity-theoretic assumptions. We complement our theoretical results with a comprehensive study of the ParHAC algorithm in terms of its scalability, performance, and quality, and compare with several state-of-the-art sequential and parallel baselines. On a broad set of large publicly-available real-world datasets, we find that ParHAC obtains a 50.1x speedup on average over the best sequential baseline, while achieving quality similar to the exact HAC algorithm. We also show that ParHAC can cluster one of the largest publicly available graph datasets with 124 billion edges in a little over three hours using a commodity multicore machine.
翻译:由于真实世界数据集的庞大规模,为等级聚合群(HAC)获取可缩放的算法非常有意义。 同时,由于算法的貌似顺序性质,有效平行的HAC很难实现。在本文中,我们讨论这一问题,并提出PARHAC,这是第一个高效的平行的HAC算法,具有广泛使用的平均链接功能的亚线深度。特别是,我们为这一问题在使用美元($\tilde{O}(m)的工作和多对数深度的美元边端图上提供了1美元(1 ⁇ epsillon)$-o(m)-o(m)-o)-o(m)-o(m)-o(m)-。此外,我们表明,在标准的复杂理论假设下,不可能获得类似平均链接的HAC(HAC)的类似界限。我们对PARHAC(PAR)的算法进行了全面研究,从可缩放、性和质量的角度对PARHAC的算法进行了全面研究,并与若干个最先进的直线级和平行的基线进行比较。在使用大型公共可获取的美元实际世界数据库数据图集上,我们发现,在最接近的MAC(PARHAAC)的3级的基数级数据上也能够显示一个最接近于最接近的MAC的基数。