Distributed Principal Component Analysis (PCA) has been studied to deal with the case when data are stored across multiple machines and communication cost or privacy concerns prohibit the computation of PCA in a central location. However, the sub-Gaussian assumption in the related literature is restrictive in real application where outliers or heavy-tailed data are common in areas such as finance and macroeconomic. In this article, we propose a distributed algorithm for estimating the principle eigenspaces without any moment constraint on the underlying distribution. We study the problem under the elliptical family framework and adopt the sample multivariate Kendall'tau matrix to extract eigenspace estimators from all sub-machines, which can be viewed as points in the Grassman manifold. We then find the "center" of these points as the final distributed estimator of the principal eigenspace. We investigate the bias and variance for the distributed estimator and derive its convergence rate which depends on the effective rank and eigengap of the scatter matrix, and the number of submachines. We show that the distributed estimator performs as if we have full access of whole data. Simulation studies show that the distributed algorithm performs comparably with the existing one for light-tailed data, while showing great advantage for heavy-tailed data. We also extend our algorithm to the distributed learning of elliptical factor models and verify its empirical usefulness through real application to a macroeconomic dataset.
翻译:已经研究过一种分布式主要元件分析(PCA), 以便处理在多机器中储存数据以及通信成本或隐私问题不允许在中央地点计算五氯苯的情况,然而,相关文献中的亚高加索假设在实际应用中具有限制性,因为在金融和宏观经济等领域中,外部数据或重尾数据是常见的。在本篇文章中,我们提出一种分布式算法,用于估计原则天体空间,而不会对基本分布造成任何时间限制。我们研究了星系框架下的问题,并采用抽样多变式肯德尔图矩阵,从所有子机器中提取天体空间估计器,可视为格拉斯曼的多重点。我们随后发现这些点的“中心”为主脑空间的最后分布式估计器。我们调查分布式估计仪的偏差和差异,并得出其趋同率,这取决于散射矩阵的有效等级和微增缩式矩阵,以及子机器的数目。我们显示,分布式的宏观经济估计式估算器在进行精确性分析时,如果我们能够充分使用分布式数据,则通过整个数据检索,则显示其模拟分析的重度数据。