Principal Component Analysis (PCA) is the workhorse tool for dimensionality reduction in this era of big data. While often overlooked, the purpose of PCA is not only to reduce data dimensionality, but also to yield features that are uncorrelated. Furthermore, the ever-increasing volume of data in the modern world often requires storage of data samples across multiple machines, which precludes the use of centralized PCA algorithms. This paper focuses on the dual objective of PCA, namely, dimensionality reduction and decorrelation of features, but in a distributed setting. This requires estimating the eigenvectors of the data covariance matrix, as opposed to only estimating the subspace spanned by the eigenvectors, when data is distributed across a network of machines. Although a few distributed solutions to the PCA problem have been proposed recently, convergence guarantees and/or communications overhead of these solutions remain a concern. With an eye towards communications efficiency, this paper introduces a feedforward neural network-based one time-scale distributed PCA algorithm termed Distributed Sanger's Algorithm (DSA) that estimates the eigenvectors of the data covariance matrix when data is distributed across an undirected and arbitrarily connected network of machines. Furthermore, the proposed algorithm is shown to converge linearly to a neighborhood of the true solution. Numerical results are also provided to demonstrate the efficacy of the proposed solution.
翻译:元件分析( PCA) 是当前海量数据时代减少维度的工具。 尽管经常被忽略, 五氯苯甲醚的目的不仅在于减少数据维度, 也在于产生与数据不相干的特点。 此外, 现代世界中数据数量不断增加, 往往需要通过多个机器储存数据样本, 从而排除使用中央化的五氯苯甲醚算法。 本文侧重于五氯苯甲醚的双重目标, 即维度减少和特性的变异性, 而在分布式环境中。 这要求估算数据变异矩阵的元体, 而不是仅仅估算在数据分布于机器网络之间时, 仅估算数据元子空间所跨越的子空间。 尽管最近提出了少量分散的关于五氯苯甲醚问题的解决方案, 但趋同保证和/ 或这些解决方案的通信间接费用仍然令人关切。 以通信效率为视角, 本文介绍了一种反馈式的神经网络分布式计算法, 称为分布式Sanger Algoithm( DSA), 而不是仅仅估算源源源数的分解的子空间间隔空间,, 并且 显示的模型显示的直线式计算结果的模型, 显示的正确的矩阵 方向的模型显示, 方向的模型显示, 方向的模型显示, 方向的模型显示的路径的模型显示, 方向的路径的模型显示, 方向式矩阵的模型显示的模型显示的路径矩阵 。