The spiked covariance model has gained increasing popularity in high-dimensional data analysis. A fundamental problem is determination of the number of spiked eigenvalues, $K$. For estimation of $K$, most attention has focused on the use of $top$ eigenvalues of sample covariance matrix, and there is little investigation into proper ways of utilizing $bulk$ eigenvalues to estimate $K$. We propose a principled approach to incorporating bulk eigenvalues in the estimation of $K$. Our method imposes a working model on the residual covariance matrix, which is assumed to be a diagonal matrix whose entries are drawn from a gamma distribution. Under this model, the bulk eigenvalues are asymptotically close to the quantiles of a fixed parametric distribution. This motivates us to propose a two-step method: the first step uses bulk eigenvalues to estimate parameters of this distribution, and the second step leverages these parameters to assist the estimation of $K$. The resulting estimator $\hat{K}$ aggregates information in a large number of bulk eigenvalues. We show the consistency of $\hat{K}$ under a standard spiked covariance model. We also propose a confidence interval estimate for $K$. Our extensive simulation studies show that the proposed method is robust and outperforms the existing methods in a range of scenarios. We apply the proposed method to analysis of a lung cancer microarray data set and the 1000 Genomes data set.
翻译:在高维数据分析中,激增的变差模型越来越受欢迎。 一个根本性的问题是确定加压的变差矩阵数量。 关于美元的估计, 大部分注意力集中在抽样同差矩阵的美元顶值上, 几乎没有调查使用美元顶值来估计美元顶值。 我们建议了一种原则性的方法, 将大宗的变差值纳入估计美元美元。 我们的方法对剩余变差矩阵规定了一种工作模型, 假设该变差矩阵是从伽马分布中提取的。 在这种模型下, 大部分的变差值与固定的参数分布的四分法几乎接近。 这促使我们提出一种两步方法: 第一步使用大宗的变差值来估计这种分布的参数, 第二步则利用这些参数来帮助估算美元。 由此得出的微变差矩阵模型是 美元基数矩阵矩阵表 。 我们的变差模型 3K 总体模型 显示一个高比值的模型 。