Principal Subspace Analysis (PSA) is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA solutions are fast becoming irrelevant in the modern era of big data, in which the number of samples and/or the dimensionality of samples often exceed the storage and/or computational capabilities of individual machines. This has led to study of distributed PSA solutions, in which the data are partitioned across multiple machines and an estimate of the principal subspace is obtained through collaboration among the machines. It is in this vein that this paper revisits the problem of distributed PSA under the general framework of an arbitrarily connected network of machines that lacks a central server. The main contributions of the paper in this regard are threefold. First, two algorithms are proposed in the paper that can be used for distributed PSA in the case of data that are partitioned across either samples or (raw) features. Second, in the case of sample-wise partitioned data, the proposed algorithm and a variant of it are analyzed, and their convergence to the true subspace at linear rates is established. Third, extensive experiments on both synthetic and real-world data are carried out to validate the usefulness of the proposed algorithms. In particular, in the case of sample-wise partitioned data, an MPI-based distributed implementation is carried out to study the interplay between network topology and communications cost as well as to study of effect of straggler machines on the proposed algorithms.
翻译:卫星本底分析(PSA)是减少信号处理和机器学习中的维度最受欢迎的方法之一,但在现代海量数据时代,集中的PSA解决方案正在迅速变得无关紧要,因为样品数量和/或样品的维度往往超过单个机器的储存和/或计算能力。这导致对分布式PSA解决方案的研究,在分布式PSA解决方案中,数据分布在多个机器之间,主要子空间的估计数是通过机器之间的合作获得的。正是本着这种精神,本文件在缺乏中央服务器的任意连接机器网络总框架内重新审视分布式PSA的问题。在这方面,文件的主要贡献是三重。首先,在分布式PSA数据分布于样品或(原始)特征之间的情况下,可以使用两种算法。第二,在取样式分区数据、拟议通信算法和拟议变式的情况下,对分布式PSA的问题进行了分析,并按线性速度与真实的子空间相融合。第三,对合成和现实型机器的主要贡献是三,对合成和真实的和真实型号机器数据进行广泛的实验,作为移动的模型分析结果,对MFM-Servial数据进行抽样分析,从而验证了对结果进行抽样分析。