A key challenge to performing effective analyses of high-dimensional data is finding a signal-rich, low-dimensional representation. For linear subspaces, this is generally performed by decomposing a design matrix (via eigenvalue or singular value decomposition) into orthogonal components, and then retaining those components with sufficient variations. This is equivalent to estimating the rank of the matrix and deciding which components to retain is generally carried out using heuristic or ad-hoc approaches such as plotting the decreasing sequence of the eigenvalues and looking for the "elbow" in the plot. While these approaches have been shown to be effective, a poorly calibrated or misjudged elbow location can result in an overabundance of noise or an under-abundance of signal in the low-dimensional representation, making subsequent modeling difficult. In this article, we propose a latent-space-construction procedure to estimate the rank of the detectable signal space of a matrix by retaining components whose variations are significantly greater than random matrices, of which eigenvalues follow a universal March\u{e}nko-Pastur (MP) distribution.
翻译:对高维数据进行有效分析的一个关键挑战是找到一个信号丰富、低维的表达式。对于线性子空间,通常通过将设计矩阵(通过egenvalue或单值分解)分解成正方形元件,然后保留这些元件,使其具有足够的变异性。这相当于估计矩阵的等级,并决定通常使用超光学或反热方法来进行保留哪些元件,例如绘制双元值的递减顺序和在绘图中寻找“ELBow”。虽然这些方法已证明是有效的,但是对肘部位置的校准不当或误判可能导致低维面表示中噪音过多或信号不足,从而使随后的建模变得困难。在本篇文章中,我们提出一个潜在空间建设程序,通过保留其变异性大大大于随机矩阵的元件,其电子元值跟随全美三月\u{nko-Pastur(MP)分布来估计矩阵可探测信号空间的等级。