等级群集的信息理论视角 (An Information-theoretic Perspective of Hierarchical Clustering)

A combinatorial cost function for hierarchical clustering was introduced by Dasgupta \cite{dasgupta2016cost}. It has been generalized by Cohen-Addad et al. \cite{cohen2019hierarchical} to a general form named admissible function. In this paper, we investigate hierarchical clustering from the \emph{information-theoretic} perspective and formulate a new objective function. We also establish the relationship between these two perspectives. In algorithmic aspect, we get rid of the traditional top-down and bottom-up frameworks, and propose a new one to stratify the \emph{sparsest} level of a cluster tree recursively in guide with our objective function. For practical use, our resulting cluster tree is not binary. Our algorithm called HCSE outputs a $k$-level cluster tree by a novel and interpretable mechanism to choose $k$ automatically without any hyper-parameter. Our experimental results on synthetic datasets show that HCSE has a great advantage in finding the intrinsic number of hierarchies, and the results on real datasets show that HCSE also achieves competitive costs over the popular algorithms LOUVAIN and HLP.

翻译：Dasgupta \ cite{dasgupta2016cost} 引入了等级集群的组合成本函数。 Cohen-Addad 等人已经将分类树的\ emph{chen2019hiarartial} 推广到一个被命名为可接受功能的一般形式。在本文中, 我们从\ emph{ 信息- 理论} 角度来调查等级集群, 并制定一个新的客观函数。我们还在这两个角度之间建立了关系。在算法方面, 我们摆脱了传统的自上而下和自下而上的框架, 并提出了一个新的框架, 以将组树的 \ emph{sparsest} 水平与我们的目标函数相交替地分级为分级。为了实际用途, 我们产生的组树不是二进制的。我们的算法称为 HCSEE 输出一个$k$- 级集群树, 并且可以解释出一个新的功能功能。我们还在合成数据集的实验结果显示, HCSE在寻找部落的内在数目方面有很大优势, 和在大众数据分析中的结果也显示, HSEEEVSEAV 上也显示HSESEAVSE 的比较成本。