The learning of mixture models can be viewed as a clustering problem. Indeed, given data samples independently generated from a mixture of distributions, we often would like to find the {\it correct target clustering} of the samples according to which component distribution they were generated from. For a clustering problem, practitioners often choose to use the simple $k$-means algorithm. $k$-means attempts to find an {\it optimal clustering} that minimizes the sum-of-squares distance between each point and its cluster center. In this paper, we consider fundamental (i.e., information-theoretic) limits of the solutions (clusterings) obtained by optimizing the sum-of-squares distance. In particular, we provide sufficient conditions for the closeness of any optimal clustering and the correct target clustering assuming that the data samples are generated from a mixture of spherical Gaussian distributions. We also generalize our results to log-concave distributions. Moreover, we show that under similar or even weaker conditions on the mixture model, any optimal clustering for the samples with reduced dimensionality is also close to the correct target clustering. These results provide intuition for the informativeness of $k$-means (with and without dimensionality reduction) as an algorithm for learning mixture models.
翻译:混合模型的学习可被视为一个组群问题。 事实上,考虑到由混合分布制独立产生的数据样本,我们常常希望找到样本中根据成分分布法生成的样本的 rit 正确目标群集 。 对于组群问题,执业者往往选择使用简单的美元平均算法。 $k$ 表示尝试找到一个 jit 最佳集聚, 以最大限度地减少每个点及其集聚中心之间的方差和方差之和。 此外, 在本文中,我们考虑了通过优化等量和方差距离获得的解决方案(集束)的基本(即信息理论)限度。 特别是,我们为任何最佳集聚和正确目标群提供了充分的条件,假设数据样品来自球形分布的混合。 我们还将我们的结果概括为日志和集分布。 此外,我们发现在类似或更弱的条件下,任何以较低维度为样本的最佳集束(集成)的模型的最佳集成量都接近于准确的基数值,而没有进行精确的基数的基数分析。