In statistical dimensionality reduction, it is common to rely on the assumption that high dimensional data tend to concentrate near a lower dimensional manifold. There is a rich literature on approximating the unknown manifold, and on exploiting such approximations in clustering, data compression, and prediction. Most of the literature relies on linear or locally linear approximations. In this article, we propose a simple and general alternative, which instead uses spheres, an approach we refer to as spherelets. We develop spherical principal components analysis (SPCA), and provide theory on the convergence rate for global and local SPCA, while showing that spherelets can provide lower covering numbers and MSEs for many manifolds. Results relative to state-of-the-art competitors show gains in ability to accurately approximate manifolds with fewer components. Unlike most competitors, which simply output lower-dimensional features, our approach projects data onto the estimated manifold to produce fitted values that can be used for model assessment and cross validation. The methods are illustrated with applications to multiple data sets.
翻译:在统计维度的减少方面,通常依赖高维数据往往集中在接近低维的方块的假设。关于近似未知的方块和在集群、数据压缩和预测中利用这类近似值的文献丰富。大多数文献依赖线性或局部线性近似值。在本篇文章中,我们提出了一个简单和一般的替代方案,它使用球球体,我们称之为球体。我们开发了球体主要组成部分分析,提供了全球和地方SPCA聚合率的理论,同时表明球体可以提供较少的覆盖数字和多个多维体的MSE。与最先进的竞争者相比,结果显示在精确接近多维体块的能力方面有所进步。与大多数竞争者相比,我们的方法项目数据在估计的方块上可以产生适合的值,用于模型评估和交叉验证。这些方法用多种数据集的应用来说明。