改进具有广泛保障的大规模数据集的等级分组 (Improved Hierarchical Clustering on Massive Datasets with Broad Guarantees)

Hierarchical clustering is a stronger extension of one of today's most influential unsupervised learning methods: clustering. The goal of this method is to create a hierarchy of clusters, thus constructing cluster evolutionary history and simultaneously finding clusterings at all resolutions. We propose four traits of interest for hierarchical clustering algorithms: (1) empirical performance, (2) theoretical guarantees, (3) cluster balance, and (4) scalability. While a number of algorithms are designed to achieve one to two of these traits at a time, there exist none that achieve all four. Inspired by Bateni et al.'s scalable and empirically successful Affinity Clustering [NeurIPs 2017], we introduce Affinity Clustering's successor, Matching Affinity Clustering. Like its predecessor, Matching Affinity Clustering maintains strong empirical performance and uses Massively Parallel Communication as its distributed model. Designed to maintain provably balanced clusters, we show that our algorithm achieves good, constant factor approximations for Moseley and Wang's revenue and Cohen-Addad et al.'s value. We show Affinity Clustering cannot approximate either function. Along the way, we also introduce an efficient $k$-sized maximum matching algorithm in the MPC model.

翻译：今天最有影响力的、不受监督的学习方法之一的分层集群是一个更强有力的延伸:集群。这个方法的目标是建立群集等级,从而构建群集进化历史,同时在所有决议上寻找群集。我们为等级群集算法提出了四个值得注意的特点:(1) 实证性、(2) 理论保障、(3) 群集平衡和(4) 缩放性。虽然一些算法旨在一次实现一至二个这些特征,但并没有实现所有四个特征。受Bateni et al. 的可扩展性和经验上成功的亲近性组合[NeurIPs 2017]的启发,我们引入“亲近性组合”的后继者“亲近性组合组合”。和前身一样,“亲近性组合”保持很强的经验性业绩,并使用质量平行通信作为其分布模式。设计来保持一个可辨称平衡的群集,但我们显示我们的算法为Mosseley和Wang的岁收入和Chen-Add et al. 的价值实现了良好、不变的系数近近似性模型。我们同时展示了美元组合法中的一种方法。