Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel \emph{trellis} data structure. This combination results in an exact algorithm that scales beyond previous state of the art, from a search space with $10^{12}$ trees to $10^{15}$ trees, and an approximate algorithm that improves over baselines, even in enormous search spaces that contain more than $10^{1000}$ trees. We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.
翻译:在多个领域,等级分组是一项关键的任务。 许多方法基于超自然学, 由此产生的组群的特性在事后研究。 但是, 在若干应用中, 自然成本功能可以用来描述组群的质量。 在这些应用中, 等级分组可以被视为组合优化问题。 为此, 我们根据 A* 搜索引入了一种新的方法。 我们通过将 A* 和新颖的\ emph{ trelis} 数据结构相结合, 克服了令人望而却步的大搜索空间。 这种组合的结果就是, 精确的算法超越了以往的先进水平, 从10 12 美元树的搜索空间到10 15美元树木, 以及超过基线的近似算法。 即便在含有10 10 000美元以上树的巨大搜索空间, 我们从经验上证明, 我们的方法比粒子物理案例和其他组群集基准的基线质量要高得多。 我们描述我们的方法如何大大改进了A* 组合的时间和空间复杂性的理论界限 。