Spectral clustering views the similarity matrix as a weighted graph, and partitions the data by minimizing a graph-cut loss. Since it minimizes the across-cluster similarity, there is no need to model the distribution within each cluster. As a result, one reduces the chance of model misspecification, which is often a risk in mixture model-based clustering. Nevertheless, compared to the latter, spectral clustering has no direct ways of quantifying the clustering uncertainty (such as the assignment probability), or allowing easy model extensions for complicated data applications. To fill this gap, we propose the Bayesian forest model as a generative graphical model for spectral clustering. This is motivated by our discovery that the posterior connecting matrix in a forest model has almost the same leading eigenvectors, as the ones used by normalized spectral clustering. To construct priors, we develop a ``forest process'' as a graph extension to the urn process, while we carefully characterize the differences in the partition probability. We derive a simple Markov chain Monte Carlo algorithm for posterior estimation, and demonstrate superior performance compared to existing algorithms. We illustrate several model-based extensions useful for data applications, including high-dimensional and multi-view clustering for images.
翻译:光谱聚合将相似的矩阵视为加权图表, 并通过尽量减少图形- 缩小损失来分割数据 。 由于它能最大限度地减少跨集群的相似性, 因此没有必要对每个组群内的分布进行模型模型化。 因此, 可以减少模型分辨错误的可能性, 这通常是基于混合模型的群集中的一种风险。 然而, 与后者相比, 光谱集群没有直接的方法来量化群集不确定性( 如分配概率), 或者为复杂的数据应用允许简单的模型扩展 。 为了填补这一空白, 我们建议贝亚斯森林模型作为光谱集的基因化图形模型。 这是因为我们发现森林模型中的后端连接矩阵几乎具有与普通光谱集集中使用的几乎相同的前导体。 要建构之前, 我们开发一个“ 森林过程”, 以图解图解扩展过程, 同时我们仔细辨别分区概率的差异 。 我们为后方估算出一个简单的Markov 链 蒙特卡洛 算法, 并显示与现有算法相比的优异性。 我们为多个基于模型的扩展数据, 包括高维图像。