High-dimensional clustering analysis is a challenging problem in statistics and machine learning, with broad applications such as the analysis of microarray data and RNA-seq data. In this paper, we propose a new clustering procedure called Spectral Clustering with Feature Selection (SC-FS), where we first obtain an initial estimate of labels via spectral clustering, then select a small fraction of features with the largest R-squared with these labels, i.e., the proportion of variation explained by group labels, and conduct clustering again using selected features. Under mild conditions, we prove that the proposed method identifies all informative features with high probability and achieves minimax optimal clustering error rate for the sparse Gaussian mixture model. Applications of SC-FS to four real world data sets demonstrate its usefulness in clustering high-dimensional data.
翻译:在统计和机器学习中,高维集群分析是一个具有挑战性的问题,其应用范围很广,例如微阵列数据和RNA-seq数据的分析。在本文中,我们提出了一个新的集群程序,称为 " 以地貌选择群集 " (SC-FS),我们首先通过光谱群集获得对标签的初步估计,然后选择带有这些标签的最大R方位的一小部分特征,即由群落标签解释的变异比例,并利用选定的特征进行群集。在温和的条件下,我们证明拟议的方法确定了所有信息性能高概率的信息性特征,并实现了稀有高斯混合模型最小最佳群集误率。SC-FS对四个真实世界数据集的应用证明了其在将高维数据组合中的有用性。