Sparse principal component analysis (sparse PCA) is a widely used technique for dimensionality reduction in multivariate analysis, addressing two key limitations of standard PCA. First, sparse PCA can be implemented in high-dimensional low sample size settings, such as genetic microarrays. Second, it improves interpretability as components are regularized to zero. However, over-regularization of sparse singular vectors can cause them to deviate greatly from the population singular vectors, potentially misrepresenting the data structure. Additionally, sparse singular vectors are often not orthogonal, resulting in shared information between components, which complicates the calculation of variance explained. To address these challenges, we propose a methodology for sparse PCA that reflects the inherent structure of the data matrix. Specifically, we identify uncorrelated submatrices of the data matrix, meaning that the covariance matrix exhibits a sparse block diagonal structure. Such sparse matrices commonly occur in high-dimensional settings. The singular vectors of such a data matrix are inherently sparse, which improves interpretability while capturing the underlying data structure. Furthermore, these singular vectors are orthogonal by construction, ensuring that they do not share information. We demonstrate the effectiveness of our method through simulations and provide real data applications. Supplementary materials for this article are available online.
翻译:稀疏主成分分析(稀疏PCA)是一种广泛应用于多元分析中的降维技术,它解决了标准PCA的两个关键局限。首先,稀疏PCA可在高维低样本量场景(如基因微阵列)中实现。其次,由于各成分被正则化至零,它提升了可解释性。然而,对稀疏奇异向量的过度正则化可能导致其严重偏离总体奇异向量,从而可能曲解数据结构。此外,稀疏奇异向量通常不正交,导致成分间存在信息共享,这使方差解释的计算复杂化。为应对这些挑战,我们提出一种反映数据矩阵固有结构的稀疏PCA方法。具体而言,我们识别数据矩阵的不相关子矩阵,这意味着协方差矩阵呈现稀疏块对角结构。此类稀疏矩阵在高维场景中普遍存在。这种数据矩阵的奇异向量本身具有稀疏性,在捕捉底层数据结构的同时提升了可解释性。此外,这些奇异向量在构造上天然正交,确保它们不共享信息。我们通过仿真验证了方法的有效性,并提供了实际数据应用案例。本文的补充材料可在线上获取。