A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $\rho$, as well as the ratio $\alpha$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $\lambda_{\text{alg}} \ge k / \sqrt{\alpha} $ to perform better than random, and the information theoretic threshold at $\lambda_{\text{it}} \approx \sqrt{-k \rho \log{\rho}} / \sqrt{\alpha}$. Finally, we discuss the case of sub-extensive sparsity $\rho$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.
翻译:用于研究子空间群集的简单模型是高维 $k$- Gaussian 混合物模型, 集集手段是稀有的矢量。 在这里, 我们提供了该模型在高维系统中的统计上最佳重建错误的精确性化描述, 具有广泛的广度, 也就是说, 当组群的非零成分的分数意味着$\ rho$, 样本数量和尺寸之间的比例为$\ alpha$, 而维度是不同的。 我们确定了信息- 理论阈值, 低于此值, 获得与真正集手段的正相关在统计上是不可能的。 此外, 我们调查了通过状态演化分析该模型中该模型的近似电文传递(AMP) 算法的性能, 也就是说, 该组群分数在需要信号- $( lambda_ text) } dialg}\ ge k/ salticr_ tal} 。 我们特别确定了在算算法中存在一种统计对数值的差, extical\\\\ raltractions lial oral ortial ortial ortial ortial ortial_ lig) ligs ex ex ex ex ex ex ex ex ex ex ex.