We consider the extreme eigenvalues of the sample covariance matrix $Q=YY^*$ under the generalized elliptical model that $Y=\Sigma^{1/2}XD.$ Here $\Sigma$ is a bounded $p \times p$ positive definite deterministic matrix representing the population covariance structure, $X$ is a $p \times n$ random matrix containing either independent columns sampled from the unit sphere in $\mathbb{R}^p$ or i.i.d. centered entries with variance $n^{-1},$ and $D$ is a diagonal random matrix containing i.i.d. entries and independent of $X.$ Such a model finds important applications in statistics and machine learning. In this paper, assuming that $p$ and $n$ are comparably large, we prove that the extreme edge eigenvalues of $Q$ can have several types of distributions depending on $\Sigma$ and $D$ asymptotically. These distributions include: Gumbel, Fr\'echet, Weibull, Tracy-Widom, Gaussian and their mixtures. On the one hand, when the random variables in $D$ have unbounded support, the edge eigenvalues of $Q$ can have either Gumbel or Fr\'echet distribution depending on the tail decay property of $D.$ On the other hand, when the random variables in $D$ have bounded support, under some mild regularity assumptions on $\Sigma,$ the edge eigenvalues of $Q$ can exhibit Weibull, Tracy-Widom, Gaussian or their mixtures. Based on our theoretical results, we consider two important applications. First, we propose some statistics and procedure to detect and estimate the possible spikes for elliptically distributed data. Second, in the context of a factor model, by using the multiplier bootstrap procedure via selecting the weights in $D,$ we propose a new algorithm to infer and estimate the number of factors in the factor model. Numerical simulations also confirm the accuracy and powerfulness of our proposed methods and illustrate better performance compared to some existing methods in the literature.
翻译:本文考虑在椭圆通用模型下(即$Y=\Sigma^{1/2}XD$),样本协方差矩阵$Q=YY^*$的极值特征值(即最大和最小特征值)分布情况。其中,$\Sigma$是$p \times p$的有界正定的确定性矩阵,表示种群协方差结构;$X$是一个含有独立列或独立同分布的中心化项和方差为$n^{-1}$的自变量的$p\times n$的随机矩阵;$D$是一个含有独立随机变量的对角矩阵,且与$X$相互独立。该模型在统计和机器学习中应用广泛。本文证明了当$p$和$n$较大时,$Q$的极值特征值在渐进意义下可以有多种类型的分布,包括Gumbel分布、Fr\'echet分布、Weibull分布、Tracy-Widom分布、正态分布及其混合分布等。当$D$中的随机变量具有无界支撑时,$Q$的极值特征值分布可以是Gumbel分布或Fr\'echet分布,这取决于随机变量的尾部衰减性质。当$D$中的随机变量具有有界支撑时,在一些对$\Sigma$的温和正则性假设下,$Q$的极值特征值可以呈现Weibull、Tracy-Widom、Gaussian 或其混合等分布。基于理论结果,本文提出了两个重要应用,一是提出了一些统计量和方法来检测和估计用椭圆分布模型表示的数据可能的异常值;二是在因子模型中,通过在$D$中选择权重,利用乘子Bootstrap过程,提出了一种推断和估计因子模型中因子数目的新算法。数值模拟结果证实了我们提出的方法的准确性和强大性,并说明了与文献中某些现有方法相比的更好性能。