Clustering is an important task in many areas of knowledge: medicine and epidemiology, genomics, environmental science, economics, visual sciences, among others. Methodologies to perform inference on the number of clusters have often been proved to be inconsistent, and introducing a dependence structure among the clusters implies additional difficulties in the estimation process. In a Bayesian setting, clustering is performed by considering the unknown partition as a random object and define a prior distribution on it. This prior distribution may be induced by models on the observations, or directly defined for the partition. Several recent results, however, have shown the difficulties in consistently estimating the number of clusters, and, therefore, the partition. The problem itself of summarising the posterior distribution on the partition remains open, given the large dimension of the partition space. This work aims at reviewing the Bayesian approaches available in the literature to perform clustering, presenting advantages and disadvantages of each of them in order to suggest future lines of research.
翻译:聚类在许多领域中都是一个重要的任务:医学和流行病学、基因组学、环境科学、经济学、视觉科学等。关于如何对聚类数量进行推断的方法经常被证明是不一致的,而在聚类之间引入依赖性结构会引起估计过程中的额外困难。在贝叶斯设置中,聚类是通过将未知分区定义为随机对象并对其定义先验分布来执行的。这个先验分布可以通过对观察值进行模型感应来引出,也可以直接针对分区进行定义。然而,几个最近的结果已经表明,一致地估计聚类数量以及分区本身是困难的。给定分区空间的大维度,关于分区后验分布的总结问题仍然是个难题。本文旨在回顾文献中现有的基于贝叶斯方法进行聚类的方法,介绍它们的优点和缺点,以便提出未来研究的方向。