We consider the Bayesian mixture of finite mixtures (MFMs) and Dirichlet process mixture (DPM) models for clustering. Recent asymptotic theory has established that DPMs overestimate the number of clusters for large samples and that estimators from both classes of models are inconsistent for the number of clusters under misspecification, but the implications for finite sample analyses are unclear. The final reported estimate after fitting these models is often a single representative clustering obtained using an MCMC summarisation technique, but it is unknown how well such a summary estimates the number of clusters. Here we investigate these practical considerations through simulations and an application to gene expression data, and find that (i) DPMs overestimate the number of clusters even in finite samples, but only to a limited degree that may be correctable using appropriate summaries, and (ii) misspecification can lead to considerable overestimation of the number of clusters in both DPMs and MFMs, but results are nevertheless often still interpretable. We provide recommendations on MCMC summarisation and suggest that although the more appealing asymptotic properties of MFMs provide strong motivation to prefer them, results obtained using MFMs and DPMs are often very similar in practice.
翻译:我们认为,在组群中,Bayesian混合物(MFMM)和Drichlet工艺混合物(DPM)混合体(DPM)的组合体(MFM)与Drichlet混合体(DPM)的组合体(DPM)是用来进行分组的。最近的零星理论证实,DPMS高估了大型样本组群的数量,而这两种模型的估测者对于分类群的数量并不相同,但对于分类分析分析的影响并不明确。在对模型进行配对后,最终报告的估算结果往往是使用MCMCMC的总结技术获得的单一代表性组群,但这种汇总对组群数的估计效果还不清楚。在这里,我们通过模拟和基因表达数据的应用来调查这些实际考虑,发现(一)DPMS高估了大型样本组群集的数量,而即使在有限的样本中,DMMMM(MM(M)的更具吸引力,而MM(MM(MM(M)和MM(MM)的结果则往往得到类似。