Principal component analysis (PCA) is a simple and popular tool for processing high-dimensional data. We investigate its effectiveness for matrix denoising. We consider the clean data are generated from a low-dimensional subspace, but masked by independent high-dimensional sub-Gaussian noises with standard deviation $\sigma$. Under the low-rank assumption on the clean data with a mild spectral gap assumption, we prove that the distance between each pair of PCA-denoised data point and the clean data point is uniformly bounded by $O(\sigma \log n)$. To illustrate the spectral gap assumption, we show it can be satisfied when the clean data are independently generated with a non-degenerate covariance matrix. We then provide a general lower bound for the error of the denoised data matrix, which indicates PCA denoising gives a uniform error bound that is rate-optimal. Furthermore, we examine how the error bound impacts downstream applications such as clustering and manifold learning. Numerical results validate our theoretical findings and reveal the importance of the uniform error.
翻译:暂无翻译