Complex topological and geometric patterns often appear embedded in high-dimensional data and seem to reflect structure related to the underlying data source, with some distortion. We show that this rich data morphology can be explained by a generic and remarkably simple statistical model, demonstrating that manifold structure in data can emerge from elementary statistical ideas of dependence, correlation and latent variables. The Latent Metric Space model consists of a collection of random fields, evaluated at locations specified by latent variables and observed in noise. Driven by high dimensionality, principal component scores associated with data from this model are uniformly concentrated around a topological manifold, homeomorphic to the latent metric space. Under further assumptions this relation may be a diffeomorphism, a Riemannian metric structure appears, and the geometry of the manifold reflects that of the latent metric space. This provides statistical justification for manifold assumptions which underlie methods ranging from clustering and topological data analysis, to nonlinear dimension reduction, regression and classification, and explains the efficacy of Principal Component Analysis as a preprocessing tool for reduction from high to moderate dimension.
翻译:复杂的地貌和几何模式往往出现在高维数据中,似乎反映了与基础数据源有关的结构,并有一些扭曲。我们表明,这种丰富的数据形态可以用一个通用的、非常简单的统计模型来解释,表明数据结构的多重结构可以来自依赖性、相关性和潜在变量等基本统计概念。隐性气象空间模型包括随机字段的收集,在根据潜在变量指定的地点进行评估,并在噪音中观察到。在高维性驱动下,与该模型数据有关的主要组成部分分数一致集中在一个表层多元、原地形态到潜在计量空间。根据进一步假设,这种关系可能是一种地貌形态,一种里伊曼度度结构出现,以及多元体的几何结构反映了潜在计量空间的特征。这为从集群和地貌数据分析到非线性尺寸减少、回归和分类等方法的多重假设提供了统计依据,并解释了主要组成部分分析作为从高度到中度减少的预处理工具的功效。