The study of stability and sensitivity of statistical methods or algorithms with respect to their data is an important problem in machine learning and statistics. The performance of the algorithm under resampling of the data is a fundamental way to measure its stability and is closely related to generalization or privacy of the algorithm. In this paper, we study the resampling sensitivity for the principal component analysis (PCA). Given an $ n \times p $ random matrix $ \mathbf{X} $, let $ \mathbf{X}^{[k]} $ be the matrix obtained from $ \mathbf{X} $ by resampling $ k $ randomly chosen entries of $ \mathbf{X} $. Let $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ denote the principal components of $ \mathbf{X} $ and $ \mathbf{X}^{[k]} $. In the proportional growth regime $ p/n \to \xi \in (0,1] $, we establish the sharp threshold for the sensitivity/stability transition of PCA. When $ k \gg n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically orthogonal. On the other hand, when $ k \ll n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically colinear. In words, we show that PCA is sensitive to the input data in the sense that resampling even a negligible portion of the input may completely change the output.
翻译:在机器学习和统计中,对统计方法或算法的稳定性和敏感性的研究是一个重要问题。在重现数据中,算法的性能是测量其稳定性的一个基本方法,并且与算法的一般化或隐私密切相关。在本文中,我们研究主要组成部分分析(PCA)的重新采样敏感性。考虑到$n\time p 随机矩阵$\mathbf{X}美元,让$\mathbf{X}{x{{{{k}美元成为从$mathbf{X}获得的矩阵。通过重现数据重现数据稳定性的一种基本方法。值为$\mathb{f{v}美元随机矩阵的重新采样敏感性。美元=mathb{x{x{x}美元(mock}美元),美元=mexf finef{x{x}美元(x_xx}美元。在比例增长制度中,美元/nexb{x}xxx_xxxxxx} 美元通过重选取的条目选择的条目。