The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition, but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.
翻译:聚合群的可应用性因可缩放性而受到限制。现有的可缩放性分类组合方法以速度取代速度质量,并往往导致聚群的过度膨胀。在本文中,我们提出了一种可缩放的、聚集性分类组合方法,但不会牺牲质量和比例以至数十亿个数据点。我们进行了详细的理论分析,表明在温和的分离条件下,我们的算法不仅可以恢复最佳的平面分割,而且可以对非参数化的DP-Means目标提供双向平衡。这引入了一种新型的等级分组方法,作为非参数分类组合目标的近似算法。我们将我们的算法与典型的等级组合组合法相联系。我们在等级和平板的组合环境里进行了广泛的实验,并表明我们所提议的方法在公开的组合基准上取得了最新的结果。最后,我们通过将它应用于300亿个查询数据集来表明我们的方法的可缩放性。人类对发现的组群的评估表明,我们的方法比当前状态的组合质量要好。