Bayesian mixture models are widely used for clustering of high-dimensional data with appropriate uncertainty quantification. However, as the dimension of the observations increases, posterior inference often tends to favor too many or too few clusters. This article explains this behavior by studying the random partition posterior in a non-standard setting with a fixed sample size and increasing data dimensionality. We provide conditions under which the finite sample posterior tends to either assign every observation to a different cluster or all observations to the same cluster as the dimension grows. Interestingly, the conditions do not depend on the choice of clustering prior, as long as all possible partitions of observations into clusters have positive prior probabilities, and hold irrespective of the true data-generating model. We then propose a class of latent mixtures for Bayesian clustering (Lamb) on a set of low-dimensional latent variables inducing a partition on the observed data. The model is amenable to scalable posterior inference and we show that it can avoid the pitfalls of high-dimensionality under mild assumptions. The proposed approach is shown to have good performance in simulation studies and an application to inferring cell types based on scRNAseq.
翻译:贝叶斯混合模型被广泛用于对高维数据进行分组,并具有适当的不确定性量化。然而,随着观测的层面的扩大,事后推论往往倾向于偏好过多或过少的组群。本条款解释了这种行为,在非标准环境下,以固定样本尺寸和增加数据维度来研究随机分区后继器。我们提供了有限样本后继器倾向于将每个观测点划入不同的组群或随着维度增长而将所有观测点划入同一组群的条件。有趣的是,条件并不取决于先前的组合选择,只要所有可能的组群观测分解都具有正面的先前概率,并且不论真正的数据生成模型,都保持。我们然后提议在一组低维潜在变量的基础上对巴伊西亚集(Lamb)进行一组潜在混合物的研究,以观察数据的分解为条件。该模型可以使用可缩放的后代推推推推法,并且我们表明,在温度假设下,它可以避免高维度误入的误入点。拟议方法显示,模拟研究表现良好,并应用以Scregions为基础的细胞类型。