Analytical understanding of how low-dimensional latent features reveal themselves in large-dimensional data is still lacking. We study this by defining a linear latent feature model with additive noise constructed from probabilistic matrices, and analytically and numerically computing the statistical distributions of pairwise correlations and eigenvalues of the correlation matrix. This allows us to resolve the latent feature structure across a wide range of data regimes set by the number of recorded variables, observations, latent features and the signal-to-noise ratio. We find a characteristic imprint of latent features in the distribution of correlations and eigenvalues and provide an analytic estimate for the boundary between signal and noise even in the absence of a clear spectral gap.
翻译:对低维潜伏特征如何在大维数据中表现出来的分析了解仍然缺乏。我们研究这一问题的方法是界定一个线性潜伏特征模型,该模型由概率矩阵构建的添加性噪音组成,并用分析和数字方法计算相关矩阵的对称相关性和天值的统计分布。这使我们能够解决由记录变量、观测、潜伏特征和信号对噪音比率组成的广泛数据系统的潜在特征结构。我们发现在相关性和天性值分布中的潜在特征的特征特征,并且即使没有明显的光谱差距,也为信号和噪音之间的界限提供了分析性估计。