发现数据中潜在的地形学和几何学:大维法则 (Discovering latent topology and geometry in data: a law of large dimension)

Complex topological and geometric patterns often appear embedded in high-dimensional data and seem to reflect structure related to the underlying data source, with some distortion. We show that this rich data morphology can be explained by a generic and remarkably simple statistical model, demonstrating that manifold structure in data can emerge from elementary statistical ideas of correlation and latent variables. The Latent Metric Space model consists of a collection of random fields, evaluated at locations specified by latent variables and observed in noise. Driven by high dimensionality, principal component scores associated with data from this model are uniformly concentrated around a topological manifold, homeomorphic to the latent metric space. Under further assumptions this relation may be a diffeomorphism, a Riemannian metric structure appears, and the geometry of the manifold reflects that of the latent metric space. This provides statistical justification for manifold assumptions which underlie methods ranging from clustering and topological data analysis, to nonlinear dimension reduction, regression and classification, and explains the efficacy of Principal Component Analysis as a preprocessing tool for reduction from high to moderate dimension.

翻译：复杂的地貌和几何模式往往出现在高维数据中,似乎反映了与基础数据源有关的结构,并有一些扭曲。我们表明,这种丰富的数据形态可以用一个通用的、非常简单的统计模型来解释,表明数据结构的多重结构可以来自相关和潜在变量的基本统计概念。隐性气象空间模型包括随机字段的收集,在潜在变量指定地点进行评估,并在噪音中观察到。在高维特性的驱动下,与该模型数据有关的主要组成部分分数一致集中在一个表层多元、原地形态到潜在计量空间。根据进一步假设,这种关系可能是一种地貌形态,一种里曼度指标结构出现,而多元结构的几何反映潜在计量空间。这为从集群和地貌数据分析到非线性尺寸的减少、回归和分类等方法的多种假设提供了统计依据,并解释了主要组成部分分析作为从高度到中度减少的预处理工具的功效。