We study statistical and computational limits of clustering when the means of the centres are sparse and their dimension is possibly much larger than the sample size. Our theoretical analysis focuses on the model $X_i = z_i \theta + \varepsilon_i, ~z_i \in \{-1,1\}, ~\varepsilon_i \thicksim \mathcal{N}(0,I)$, which has two clusters with centres $\theta$ and $-\theta$. We provide a finite sample analysis of a new sparse clustering algorithm based on sparse PCA and show that it achieves the minimax optimal misclustering rate in the regime $\|\theta\| \rightarrow \infty$. Our results require the sparsity to grow slower than the square root of the sample size. Using a recent framework for computational lower bounds -- the low-degree likelihood ratio -- we give evidence that this condition is necessary for any polynomial-time clustering algorithm to succeed below the BBP threshold. This complements existing evidence based on reductions and statistical query lower bounds. Compared to these existing results, we cover a wider set of parameter regimes and give a more precise understanding of the runtime required and the misclustering error achievable. Our results imply that a large class of tests based on low-degree polynomials fail to solve even the weak testing task.
翻译:当中心手段稀少,其规模可能比抽样规模大得多时,我们研究集群的统计和计算限制。我们的理论分析侧重于模型 $X_i = z_i = z_i\theta + varepsilon_i, ~ z_ i\ in ⁇ -1, 1 ⁇, ⁇ varepsilon_ i\ ticksim\ mathscal{N}(0,I)$),它有两个集群,中心为美元和美元。我们对基于稀有五氯苯甲醚的新的稀散集群算法进行了有限的抽样分析,并显示它实现了制度内最小最大最佳组合率 ${theta\\\\ rightrow\ inty_i, ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 以 以 方向的低 的 的 的 的 方向测试。