用于分区大数据的主要子空间分析分布式主要子空间分析:算法、分析和实施 (Distributed Principal Subspace Analysis for Partitioned Big Data: Algorithms, Analysis, and Implementation)

Principal Subspace Analysis (PSA) -- and its sibling, Principal Component Analysis (PCA) -- is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA/PCA solutions are fast becoming irrelevant in the modern era of big data, in which the number of samples and/or the dimensionality of samples often exceed the storage and/or computational capabilities of individual machines. This has led to the study of distributed PSA/PCA solutions, in which the data are partitioned across multiple machines and an estimate of the principal subspace is obtained through collaboration among the machines. It is in this vein that this paper revisits the problem of distributed PSA/PCA under the general framework of an arbitrarily connected network of machines that lacks a central server. The main contributions of the paper in this regard are threefold. First, two algorithms are proposed in the paper that can be used for distributed PSA/PCA, with one in the case of data partitioned across samples and the other in the case of data partitioned across (raw) features. Second, in the case of sample-wise partitioned data, the proposed algorithm and a variant of it are analyzed, and their convergence to the true subspace at linear rates is established. Third, extensive experiments on both synthetic and real-world data are carried out to validate the usefulness of the proposed algorithms. In particular, in the case of sample-wise partitioned data, an MPI-based distributed implementation is carried out to study the interplay between network topology and communications cost as well as to study the effects of straggler machines on the proposed algorithms.

翻译：中央PSA/PCA解决方案是减少信号处理和机器学习中的维度最受欢迎的方法之一。但中央PSA/PCA解决方案在现代大数据时代迅速变得无关紧要,在这种时代,样品的数量和/或样本的维度往往超过单个机器的储存和/或计算能力。这导致对分布式PSA/PCA解决方案的研究,其中数据分布于多个机器之间,主要子空间的估计是通过机器之间的协作获得的。正是在这种意义上,本文在任意连接的缺乏中央服务器的机器网络的总框架内,重新探讨分布式PSA/PCA解决方案的问题。在这方面,文件的主要贡献是三重。首先,在文件中提出了两种算法,可用于分发PSA/PCA,其中一种是数据分布在多个样本之间,另一个是数据分割式,另一个是数据分布式数据分布式在(拉动)各特征之间。第二,在样品递增的PSA/PCA/PCA中,在实际分割式数据中,在进行的一项对真实比例进行的分析中,在对数据进行的分析中,在实际分割式分析中,在进行数据进行的一项是对真实比例进行的数据进行的分析,在第三阶段进行的分析中,在数据进行的一项分析中,在进行中,在进行中,在进行的一项对数据进行的一项对数据进行的分析中,在进行的一项对数据进行的一项对数据进行的一项对数据进行的分析是,在进行的一项对数据进行的一项分析中,在进行的一种是,在进行的一项对数据进行的一项对数据进行的一项对数据进行的一项对地分析,在进行的一项对地分析中进行的一项对地分析,在进行的一项对地分析,在进行的一项对地分析,在进行的一项对地分析中,在进行的一项对地分析中,在进行的一项对数据进行的一项对地分析是,在进行一项对地的对地分析,在进行的一种对地分析,在进行一项对地分析,在进行的一项对地分析中,在进行的一种对地分析中,在对地算法和对地分析中,在进行的一项对地分析中,在对地分析中,在进行的一项对地分析中,在进行的一种对地的对地的对地的对地的对地的对地的对地分析中,在进行的一项对地的对地分析中,在对地分析中,在进行