In applied multivariate statistics, estimating the number of latent dimensions or the number of clusters is a fundamental and recurring problem. One common diagnostic is the scree plot, which shows the largest eigenvalues of the data matrix; the user searches for a "gap" or "elbow" in the decreasing eigenvalues; unfortunately, these patterns can hide beneath the bias of the sample eigenvalues. This methodological problem is conceptually difficult because, in many situations, there is only enough signal to detect a subset of the $k$ population dimensions/eigenvectors. In this situation, one could argue that the correct choice of $k$ is the number of detectable dimensions. We alleviate these problems with cross-validated eigenvalues. Under a large class of random graph models, without any parametric assumptions, we provide a p-value for each sample eigenvector. It tests the null hypothesis that this sample eigenvector is orthogonal to (i.e., uncorrelated with) the true latent dimensions. This approach naturally adapts to problems where some dimensions are not statistically detectable. In scenarios where all $k$ dimensions can be estimated, we prove that our procedure consistently estimates $k$. In simulations and a data example, the proposed estimator compares favorably to alternative approaches in both computational and statistical performance.
翻译:在应用的多变量统计中,估计潜在维度或组群的数量是一个根本性和反复出现的问题。一个常见的诊断是,Scree图谱显示数据矩阵的最大值;用户在不断下降的源值中寻找“gap”或“elbow”;不幸的是,这些图案可以隐藏在样本的偏差之下。这个方法问题在概念上是困难的,因为在许多情况下,只有足够的信号可以探测到美元人口维度/源值的子集。在这种情况下,人们可以争辩说,美元选择的正确值是可检测的维度的数量。我们用交叉validated eigenvalue来缓解这些问题。在大量随机图表模型中,在没有任何参数假设的情况下,我们为每个样本的源值提供了一种p-value值。这个方法在概念上是困难的,因为在许多情形下,这个样源值的偏差值与(例如,与真正的潜在维值/源值有关)。在这种情况下,人们可以论证正确的选择$的正确值选择值是可检测的维度。这个方法自然地适应了某些维值,我们统计维值的计算方法。