High-dimensional data clustering has become and remains a challenging task for modern statistics and machine learning, with a wide range of applications. We consider in this work the powerful discriminative latent mixture model, and we extend it to the Bayesian framework. Modeling data as a mixture of Gaussians in a low-dimensional discriminative subspace, a Gaussian prior distribution is introduced over the latent group means and a family of twelve submodels are derived considering different covariance structures. Model inference is done with a variational EM algorithm, while the discriminative subspace is estimated via a Fisher-step maximizing an unsupervised Fisher criterion. An empirical Bayes procedure is proposed for the estimation of the prior hyper-parameters, and an integrated classification likelihood criterion is derived for selecting both the number of clusters and the submodel. The performances of the resulting Bayesian Fisher-EM algorithm are investigated in two thorough simulated scenarios, regarding both dimensionality as well as noise and assessing its superiority with respect to state-of-the-art Gaussian subspace clustering models. In addition to standard real data benchmarks, an application to single image denoising is proposed, displaying relevant results. This work comes with a reference implementation for the R software in the FisherEM package accompanying the paper.
翻译:对现代统计和机器学习来说,高维数据集群已成为一项具有挑战性的任务,现在仍然是一项具有广泛应用的艰巨任务。我们认为,在这项工作中,有强大的歧视性潜在混合模型,我们将其推广到巴伊西亚框架。建模数据是高斯人在一个低维歧视子空间中的混合体,一个高斯人以前分布在潜层中,一个由12个子模型组成的组合在考虑不同的共变结构的情况下产生。模型的推论采用一个变异的EM算法,而歧视子空间则通过一个渔业步骤,最大限度地扩大一个不受监督的渔业标准来估计。一个经验性贝亚斯程序是用来估计先前的超参数的。一个经验性贝亚斯程序是用来估算先前的超参数,而一个综合分类可能性标准标准标准是用于选择组数和亚型模型。由此产生的巴伊斯人Fisherishian-EM算法的性能在两种彻底的模拟假设中进行了调查,其中既涉及维度,也涉及噪音,也涉及对它相对于状态高斯次空间组合的优越性进行估测算。除了标准的实际数据基准外,在显示单一图像执行结果的软件的软件应用中,一个相关的软件应用。