Clustering is a well-known and studied problem, one of its variants, called contiguity-constrained clustering, accepts as a second input a graph used to encode prior information about cluster structure by means of contiguity constraints i.e. clusters must form connected subgraphs of this graph. This paper discusses the interest of such a setting and proposes a new way to formalise it in a Bayesian setting, using results on spanning trees to compute exactly a posteriori probabilities of candidate partitions. An algorithmic solution is then investigated to find a maximum a posteriori (MAP) partition and extract a Bayesian dendrogram from it. The interest of this last tool, which is reminiscent of the classical output of a simple hierarchical clustering algorithm, is analysed. Finally, the proposed approach is demonstrated with real applications. A reference implementation of this work is available in the R package gtclust that accompanies the paper (available at http://github.com/comeetie/gtclust)
翻译:集群是一个众所周知和研究过的问题,其中一种变种,称为毗连受限制的集群,作为第二个输入,接受一个图表,用于通过毗连限制(即集群必须形成该图的连接子集)来编码先前的集束结构信息。本文讨论这种设置的兴趣,并提议一种新的方式在巴伊西亚环境中正式确定这种设置,利用横跨树木的结果来精确地计算候选分区的后生概率。然后对一种算法解决办法进行调查,以找到一个最大的后生(MAP)分区,并从中提取一个巴伊西亚德罗格拉姆。对最后一个工具的兴趣进行了分析,因为它是简单分层集算法经典输出的同义。最后,提议的方法以实际应用方式加以示范。这项工作的参考实施情况见于与该文件相配套的R包(http://github.com/comeetie/gtlustlt)中(http://github.com/comeetrieti/gtlult)。</s>