We investigate a clustering problem with data from a mixture of Gaussians that share a common but unknown, and potentially ill-conditioned, covariance matrix. We start by considering Gaussian mixtures with two equally-sized components and derive a Max-Cut integer program based on maximum likelihood estimation. We prove its solutions achieve the optimal misclassification rate when the number of samples grows linearly in the dimension, up to a logarithmic factor. However, solving the Max-cut problem appears to be computationally intractable. To overcome this, we develop an efficient spectral algorithm that attains the optimal rate but requires a quadratic sample size. Although this sample complexity is worse than that of the Max-cut problem, we conjecture that no polynomial-time method can perform better. Furthermore, we gather numerical and theoretical evidence that supports the existence of a statistical-computational gap. Finally, we generalize the Max-Cut program to a $k$-means program that handles multi-component mixtures with possibly unequal weights. It enjoys similar optimality guarantees for mixtures of distributions that satisfy a transportation-cost inequality, encompassing Gaussian and strongly log-concave distributions.
翻译:我们从一个具有共同但未知且可能条件差的高斯人混合体的数据中调查一个组群问题。 我们首先考虑高斯混合体,其中含有两个同等大小的成分,然后根据最大可能性估计得出一个最大- Cut 整数程序。 我们证明,当样本数量在尺寸上线性增长时,其解决方案达到了最佳的分类率, 直至对数系数。 但是, 解决最大截断问题似乎在计算上难以解决。 为了克服这一点, 我们开发了一个高效的光谱算法, 该算法可以达到最佳速率, 但需要一个二次曲线样本大小。 尽管这种样本复杂性比最大切割问题要差, 但我们推测, 任何多球- 时间方法都不会有更好的效果。 此外, 我们收集数字和理论证据, 支持存在统计- 计算差距。 最后, 我们将 Max- Cut 程序推广到一个 $k- poke 比例程序, 处理可能不平等的多构件混合物。 它拥有类似的最佳性保证, 可以满足运输- 成本分布的混合, 包括高戈斯和逻辑分布。