This paper is concerned with estimating the column subspace of a low-rank matrix $\boldsymbol{X}^\star \in \mathbb{R}^{n_1\times n_2}$ from contaminated data. How to obtain optimal statistical accuracy while accommodating the widest range of signal-to-noise ratios (SNRs) becomes particularly challenging in the presence of heteroskedastic noise and unbalanced dimensionality (i.e., $n_2\gg n_1$). While the state-of-the-art algorithm $\textsf{HeteroPCA}$ emerges as a powerful solution for solving this problem, it suffers from "the curse of ill-conditioning," namely, its performance degrades as the condition number of $\boldsymbol{X}^\star$ grows. In order to overcome this critical issue without compromising the range of allowable SNRs, we propose a novel algorithm, called $\textsf{Deflated-HeteroPCA}$, that achieves near-optimal and condition-number-free theoretical guarantees in terms of both $\ell_2$ and $\ell_{2,\infty}$ statistical accuracy. The proposed algorithm divides the spectrum of $\boldsymbol{X}^\star$ into well-conditioned and mutually well-separated subblocks, and applies $\textsf{HeteroPCA}$ to conquer each subblock successively. Further, an application of our algorithm and theory to two canonical examples -- the factor model and tensor PCA -- leads to remarkable improvement for each application.
翻译:本文关注从污染数据中估算低位矩阵 $\ boldsymbol{ X<unk> star {x_star\ $ in\ mathbb{R<unk> n__1\timen_2}$ 的列子空间。 如何获得最佳统计准确性, 同时又能容纳最广泛的信号- 噪音比率( SNRs), 而在超位噪音和不平衡的维度( 即, $_ 2\gg n_ 1美元) 的情况下, 尤其具有挑战性。 虽然最先进的算法 $\ textsf{ HeteroPCA} 成为解决这一问题的有力解决方案, 但它却受到“ 错误诅咒” 的困扰。 即, 它的性能会随着 $\ boldsymallsballsbol{X\ star$的增长。 为了克服这个关键问题, 同时又不损害可允许的SNRIS( 范围), 我们提议一种新型算法, 叫做 $\ textf{ defrifleflead- Heter- HeloaroPA} $- host lical_deal_recklex_ral_ral_ rodeal_ dexal_ exal_ exal_ exalalalalal_ exalbalbal_ exalbisal__ $2, $2;</s>