This paper studies how to construct confidence regions for principal component analysis (PCA) in high dimension, a problem that has been vastly under-explored. While computing measures of uncertainty for nonlinear/nonconvex estimators is in general difficult in high dimension, the challenge is further compounded by the prevalent presence of missing data and heteroskedastic noise. We propose a suite of solutions to perform valid inference on the principal subspace based on two estimators: a vanilla SVD-based approach, and a more refined iterative scheme called $\textsf{HeteroPCA}$ (Zhang et al., 2018). We develop non-asymptotic distributional guarantees for both estimators, and demonstrate how these can be invoked to compute both confidence regions for the principal subspace and entrywise confidence intervals for the spiked covariance matrix. Particularly worth highlighting is the inference procedure built on top of $\textsf{HeteroPCA}$, which is not only valid but also statistically efficient for broader scenarios (e.g., it covers a wider range of missing rates and signal-to-noise ratios). Our solutions are fully data-driven and adaptive to heteroskedastic random noise, without requiring prior knowledge about the noise levels and noise distributions.
翻译:本文研究如何在高维方面为主要组成部分分析(PCA)构建信任区,这个问题一直没有得到充分探讨。虽然计算非线性/非非convex估计器的不确定性的测量方法在高维方面总体上困难重重,但缺乏数据和心电图噪音的普遍存在使挑战更加复杂。我们提出了一系列解决方案,以基于两个估计器(香草SVD基方法和更精细的迭代方案)对主要次空间进行有效推断:香草SVD基方案,以及称为$\textsf{HeteroPCA}$(Zhang等人,2018年)的更精细的迭代方案。我们为非线性/非线性 convex估计器的测量器制定非线性分布保证,并展示如何利用这些数据对主要亚空基空间和热量组合的切入度间隔进行兼容。我们特别值得强调的是,在$\ textsf{HeteroPCA}顶端建立的推论程序不仅有效,而且对更广泛的设想方案也具有统计效率(例如,我们事先需要更广义的、更广义的存储率数据比例,而需要我们更广义的测量的噪音水平的测量数据流数据流数据流到比。