We study the sparse high-dimensional Gaussian mixture model when the number of clusters is allowed to grow with the sample size. A minimax lower bound for parameter estimation is established, and we show that a constrained maximum likelihood estimator achieves the minimax lower bound. However, this optimization-based estimator is computationally intractable because the objective function is highly nonconvex and the feasible set involves discrete structures. To address the computational challenge, we propose a Bayesian approach to estimate high-dimensional Gaussian mixtures whose cluster centers exhibit sparsity using a continuous spike-and-slab prior. Posterior inference can be efficiently computed using an easy-to-implement Gibbs sampler. We further prove that the posterior contraction rate of the proposed Bayesian method is minimax optimal. The mis-clustering rate is obtained as a by-product using tools from matrix perturbation theory. The proposed Bayesian sparse Gaussian mixture model does not require pre-specifying the number of clusters, which can be adaptively estimated via the Gibbs sampler. The validity and usefulness of the proposed method is demonstrated through simulation studies and the analysis of a real-world single-cell RNA sequencing dataset.
翻译:当允许组群数量随取样规模增长时,我们研究稀有的高维高斯混合物模型,当允许组群数量随取样规模增长时,我们研究稀有的高维高斯混合物模型。建立了低度参数估计下限,我们发现,受限制的最大概率估计器可以达到小负轴下限。然而,这种基于优化的估测器在计算上是难以做到的,因为客观功能高度非混凝土,而可行的数据集涉及离散结构。为了应对计算挑战,我们建议采用巴伊西亚办法估算高质混合物,其集聚中心在使用连续的爬升和悬浮前期展示出聚群的广度。利用易于执行的Gibs取样器,可以有效计算出外缘值。我们进一步证明,拟议的巴伊西亚方法的后端收缩率是最优化的。使用矩阵扰动理论工具获得的错误组合率。拟议的巴伊西亚稀薄的混合物模型模型不需要预先确定群集体的数量,通过GIBS取样器进行适应性估计。通过真实的取样器进行模拟和单一数据序列分析,所展示的方法的有效性和效用分析。