Cluster analysis aims at partitioning data into groups or clusters. In applications, it is common to deal with problems where the number of clusters is unknown. Bayesian mixture models employed in such applications usually specify a flexible prior that takes into account the uncertainty with respect to the number of clusters. However, a major empirical challenge involving the use of these models is in the characterisation of the induced prior on the partitions. This work introduces an approach to compute descriptive statistics of the prior on the partitions for three selected Bayesian mixture models developed in the areas of Bayesian finite mixtures and Bayesian nonparametrics. The proposed methodology involves computationally efficient enumeration of the prior on the number of clusters in-sample (termed as ``data clusters'') and determining the first two prior moments of symmetric additive statistics characterising the partitions. The accompanying reference implementation is made available in the R package 'fipp'. Finally, we illustrate the proposed methodology through comparisons and also discuss the implications for prior elicitation in applications.
翻译:分组分析旨在将数据分成组或组群。在应用中,常见的做法是处理组群数目未知的问题。在这类应用中使用的巴伊西亚混合模型通常在先订一个灵活的模型,其中考虑到组群数目的不确定性。但是,使用这些模型的主要经验挑战在于分区先前引出的数据的特性。这项工作采用了一种方法来计算巴伊西亚有限混合物和巴伊西亚非参数区域开发的三种选定巴伊西亚混合物模型分区先前的描述性统计数据。拟议方法包括计算有效点出以前标本内组群数(称为“数据组”)的数字,并确定分区前两个对称添加统计数据的先两个阶段。随附的参考执行情况见R包“fip”。最后,我们通过比较来说明拟议的方法,并讨论应用中先引引的影响。