In many applications, there is interest in clustering very high-dimensional data. A common strategy is first stage dimensionality reduction followed by a standard clustering algorithm, such as k-means. This approach does not target dimension reduction to the clustering objective, and fails to quantify uncertainty. Model-based Bayesian approaches provide an appealing alternative, but often have poor performance in high-dimensions, producing too many or too few clusters. This article provides an explanation for this behavior through studying the clustering posterior in a non-standard setting with fixed sample size and increasing dimensionality. We show that the finite sample posterior tends to either assign every observation to a different cluster or all observations to the same cluster as dimension grows, depending on the kernels and prior specification but not on the true data-generating model. To find models avoiding this pitfall, we define a Bayesian oracle for clustering, with the oracle clustering posterior based on the true values of low-dimensional latent variables. We define a class of LAtent Mixtures for Bayesian (Lamb) clustering that have equivalent behavior to this oracle as dimension grows. Lamb is shown to have good performance in simulation studies and an application to inferring cell types based on scRNAseq.
翻译:在许多应用中,人们都有兴趣将非常高维的数据组合在一起。 共同战略是第一阶段的维度减少, 并辅之以标准的群集算法, 例如 k- 运算法。 这个方法并不针对群集目标的维度减少, 并且未能量化不确定性。 以模型为基础的巴伊西亚方法提供了一个有吸引力的替代方法, 但通常在高二分化中表现不佳, 产生太多或太少的群集。 本条通过在非标准设置中以固定的样本大小和日益增强的维度来研究群集后子体来解释这一行为。 我们显示, 有限的样本后端组往往根据内核和先前的规格, 而不是根据真正的数据生成模型, 将每组的每个观察都指派给不同的群, 或所有观察都指派给同一组。 为了找到避免这种陷阱的模型, 我们根据低维潜伏变量的真正价值来定义一个贝伊西亚( 蓝比) 类的Latent Mixturs 集群, 其行为都相当于这个尺寸增长的星系, 。 在模拟中, 模级应用中, 显示一个良好的性 。