Bayesian clustering typically relies on mixture models, with each component interpreted as a different cluster. After defining a prior for the component parameters and weights, Markov chain Monte Carlo (MCMC) algorithms are commonly used to produce samples from the posterior distribution of the component labels. The data are then clustered by minimizing the expectation of a clustering loss function that favours similarity to the component labels. Unfortunately, although these approaches are routinely implemented, clustering results are highly sensitive to kernel misspecification. For example, if Gaussian kernels are used but the true density of data within a cluster is even slightly non-Gaussian, then clusters will be broken into multiple Gaussian components. To address this problem, we develop Fusing of Localized Densities (FOLD), a novel clustering method that melds components together using the posterior of the kernels. FOLD has a fully Bayesian decision theoretic justification, naturally leads to uncertainty quantification, can be easily implemented as an add-on to MCMC algorithms for mixtures, and favours a small number of distinct clusters. We provide theoretical support for FOLD including clustering optimality under kernel misspecification. In simulated experiments and real data, FOLD outperforms competitors by minimizing the number of clusters while inferring meaningful group structure.
翻译:贝叶斯聚类通常依赖于混合模型,其中每个组件被解释为不同的簇。在为组件参数和权重定义先验后,通常使用马尔可夫链蒙特卡罗(MCMC)算法从组件标签的后验分布中产生样本。然后,通过最小化聚类损失函数的期望来将数据进行聚类,该函数有利于与组件标签的相似性。不幸的是,尽管这些方法经常被实现,但聚类结果对核函数规范化极为敏感。例如,如果使用高斯核但簇内数据的真实密度略微非高斯,则簇将被分成多个高斯组件。为了解决这个问题,我们开发了一种名为局部密度融合(FOLD)的新聚类方法,该方法使用核函数的后验将组件融合在一起。FOLD具有完全贝叶斯决策理论的依据,自然地导致不确定性量化,可以很容易地作为混合MCMC算法的附加组件实现,并有利于较少的不同簇数。我们提供了对FOLD的理论支持,包括在核函数规范化错误下的聚类最优性。在模拟实验和实际数据中,FOLD通过最小化聚类数并推断有意义的群体结构而优于竞争对手。