Hierarchical clustering studies a recursive partition of a data set into clusters of successively smaller size, and is a fundamental problem in data analysis. In this work we study the cost function for hierarchical clustering introduced by Dasgupta, and present two polynomial-time approximation algorithms: Our first result is an $O(1)$-approximation algorithm for graphs of high conductance. Our simple construction bypasses complicated recursive routines of finding sparse cuts known in the literature. Our second and main result is an $O(1)$-approximation algorithm for a wide family of graphs that exhibit a well-defined structure of clusters. This result generalises the previous state-of-the-art, which holds only for graphs generated from stochastic models. The significance of our work is demonstrated by the empirical analysis on both synthetic and real-world data sets, on which our presented algorithm outperforms the previously proposed algorithm for graphs with a well-defined cluster structure.
翻译:对一组数据进行递归分解,将其分成相继较小大小的组群,这是数据分析的一个根本问题。在这项工作中,我们研究了Dasgupta引进的等级组合的成本函数,并提出了两种多元时近似算法:我们的第一个结果是用于高导力图的1美元(1美元)-近似算法。我们简单的建筑绕行复杂的循环常规,以寻找文献中已知的稀薄削减。我们的第二个和主要结果是为显示一个明确界定的组群结构的广大组群图组的1美元(1美元)-相配算法。这个结果概括了以前的艺术状态,它只保存在从随机模型生成的图组中。我们工作的意义通过对合成和真实世界数据集进行的经验分析来证明,我们所介绍的算法比先前提议的具有明确界定的组群结构的图表的算法要大得多。