To avoid the curse of dimensionality, a common approach to clustering high-dimensional data is to first project the data into a space of reduced dimension, and then cluster the projected data. Although effective, this two-stage approach prevents joint optimization of the dimensionality-reduction and clustering models, and obscures how well the complete model describes the data. Here, we show how a family of such two-stage models can be combined into a single, hierarchical model that we call a hierarchical mixture of Gaussians (HMoG). An HMoG simultaneously captures both dimensionality-reduction and clustering, and its performance is quantified in closed-form by the likelihood function. By formulating and extending existing models with exponential family theory, we show how to maximize the likelihood of HMoGs with expectation-maximization. We apply HMoGs to synthetic data and RNA sequencing data, and demonstrate how they exceed the limitations of two-stage models. Ultimately, HMoGs are a rigorous generalization of a common statistical framework, and provide researchers with a method to improve model performance when clustering high-dimensional data.
翻译:为了避免维度的诅咒,将高维数据分组的通用方法是首先将数据投射到一个降低维度的空间,然后将预测的数据分组。虽然这一两阶段方法有效,但阻止了对维度减少和组集模型的联合优化,并模糊了完整的模型对数据描述的准确性。这里,我们展示了如何将这种两阶段模型的组合合并成一个单一的等级模型,我们称之为高森的等级混合。一个HMOG同时捕捉了维度减少和组集,其性能按可能性函数以封闭形式量化。我们用指数式家庭理论来制定和推广现有模型,我们展示了如何最大限度地利用预期-最大程度的合成数据和RNA排序数据的可能性。我们将HMOG应用于合成数据和RNA数据排序,并展示它们如何超越两阶段模型的局限性。最终,HMOG是共同统计框架的严格概括,并为研究人员提供在将高维度数据组合时改进模型性能的方法。