Bayesian mixture models are widely used for clustering of high-dimensional data with appropriate uncertainty quantification, but as the dimension increases, posterior inference often tends to favor too many or too few clusters. This article explains this behavior by studying the random partition posterior in a non-standard setting with a fixed sample size and increasing data dimensionality. We show conditions under which the finite sample posterior tends to either assign every observation to a different cluster or all observations to the same cluster as the dimension grows. Interestingly, the conditions do not depend on the choice of clustering prior, as long as all possible partitions of observations into clusters have positive prior probabilities, and hold irrespective of the true data-generating model. We then propose a class of latent mixtures for Bayesian clustering (Lamb) on a set of low-dimensional latent variables inducing a partition on the observed data. The model is amenable to scalable posterior inference and we show that it avoids the pitfalls of high-dimensionality under reasonable and mild assumptions. The proposed approach is shown to have good performance in simulation studies and an application to inferring cell types based on scRNAseq.
翻译:贝叶斯混合物模型被广泛用于以适当的不确定性量化方式对高维数据进行分组,但随着尺寸的增加,事后推论往往倾向于过多或过少的组群。本条款解释了这种行为,在固定样本大小和数据维度增加的非标准设置中研究随机分区后遗物。我们展示了有限样本后遗物倾向于将每个观测点划入不同的组群或随着尺寸增长而将所有观测点划入同一组群的条件。有趣的是,条件并不取决于先前对组群的选择,只要所有可能的组群观测分解都具有正面的先前概率,并且不论真正的数据生成模型,都持有。我们随后提议在一组低维度潜在变量中为巴伊斯群集(Lamb)提供一组潜在混合物,以引导观察到的数据的分区。该模型容易进行可缩放的后遗物推断,我们证明,在合理和温和假设下,它避免高维度的陷阱。拟议方法显示模拟研究成绩良好,并应用基于ScNA的细胞定型的细胞。