The article introduces an elementary cost and storage reduction method for spectral clustering and principal component analysis. The method consists in randomly "puncturing" both the data matrix $X\in\mathbb{C}^{p\times n}$ (or $\mathbb{R}^{p\times n}$) and its corresponding kernel (Gram) matrix $K$ through Bernoulli masks: $S\in\{0,1\}^{p\times n}$ for $X$ and $B\in\{0,1\}^{n\times n}$ for $K$. The resulting "two-way punctured" kernel is thus given by $K=\frac{1}{p}[(X \odot S)^{\sf H} (X \odot S)] \odot B$. We demonstrate that, for $X$ composed of independent columns drawn from a Gaussian mixture model, as $n,p\to\infty$ with $p/n\to c_0\in(0,\infty)$, the spectral behavior of $K$ -- its limiting eigenvalue distribution, as well as its isolated eigenvalues and eigenvectors -- is fully tractable and exhibits a series of counter-intuitive phenomena. We notably prove, and empirically confirm on GAN-generated image databases, that it is possible to drastically puncture the data, thereby providing possibly huge computational and storage gains, for a virtually constant (clustering of PCA) performance. This preliminary study opens as such the path towards rethinking, from a large dimensional standpoint, computational and storage costs in elementary machine learning models.
翻译:文章为光谱集和主元件分析引入了基本成本和存储削减方法。 方法包括随机“ 跳动” 数据矩阵 $X\ in\ mathbb{C\ p\ p\time n} $ (或$\ mathb{R\ p\ time n} $) 及其相应的内核( gram) 矩阵 $K$ (通过 Bernoulli 面罩 : $S\ 10, 1\\ p\ time n} 美元, 美元, 美元, 美元, 0. 0, 1\ n\\ n\ f time n} 美元。 因此, 由此产生的“ 双向双向双向双向崩溃的” 内核内核内核( tway puncrec{1\ p} [ (x\\\ odobot S) hitlemmmission) 。 我们证明, $X$, p\\\\\\\ in pretimeal_deal deal deal deal deal deal deal as a missional.