Dimensionality reduction via PCA and factor analysis is an important tool of data analysis. A critical step is selecting the number of components. However, existing methods (such as the scree plot, likelihood ratio, parallel analysis, etc) do not have statistical guarantees in the increasingly common setting where the data are heterogeneous. There each noise entry can have a different distribution. To address this problem, we propose the Signflip Parallel Analysis (Signflip PA) method: it compares data singular values to those of "empirical null" data generated by flipping the sign of each entry randomly with probability one-half. We show that Signflip PA consistently selects factors above the noise level in high-dimensional signal-plus-noise models (including spiked models and factor models) under heterogeneous settings. Here classical parallel analysis is no longer effective. To do this, we rely on recent results in random matrix theory, such as dimension-free operator norm bounds [Latala et al, 2018, Inventiones Mathematicae], and large deviations for the top eigenvalues of nonhomogeneous matrices [Husson, 2020]. We also illustrate that Signflip PA performs well in numerical simulations and on empirical data examples.
翻译:通过PCA和因子分析的降维是数据分析的一个重要工具。 选择分量数量是关键步骤。 然而,现有方法(如scree图,似然比,并行分析等)在日益普及的数据异质性环境中没有统计保证。 在这种情况下,每个噪声条目都可以具有不同的分布。 为了解决这个问题,我们提出了正负号翻转并行分析(Signflip PA)方法:将数据的奇异值与通过将每个条目的符号随机翻转为一半的“经验空值”数据的奇异值进行比较。 我们表明,Signflip PA在高维信号加噪声模型(包括尖峰模型和因子模型)中,在异质性设置下,始终选择噪声水平以上的因子。 在这里,经典的并行分析不再有效。 为此,我们依赖于最近的随机矩阵理论结果,例如无维运算符范数界[Latala et al,2018,Inventiones Mathematicae]和非同质矩阵的前几个特征值的大偏差[Husson,2020]。 我们还说明了Signflip PA在数值模拟和实证数据示例中表现良好。