The growing size of modern data sets brings many challenges to the existing statistical estimation approaches, which calls for new distributed methodologies. This paper studies distributed estimation for a fundamental statistical machine learning problem, principal component analysis (PCA). Despite the massive literature on top eigenvector estimation, much less is presented for the top-$L$-dim ($L>1$) eigenspace estimation, especially in a distributed manner. We propose a novel multi-round algorithm for constructing top-$L$-dim eigenspace for distributed data. Our algorithm takes advantage of shift-and-invert preconditioning and convex optimization. Our estimator is communication-efficient and achieves a fast convergence rate. In contrast to the existing divide-and-conquer algorithm, our approach has no restriction on the number of machines. Theoretically, the traditional Davis-Kahan theorem requires the explicit eigengap assumption to estimate the top-$L$-dim eigenspace. To abandon this eigengap assumption, we consider a new route in our analysis: instead of exactly identifying the top-$L$-dim eigenspace, we show that our estimator is able to cover the targeted top-$L$-dim population eigenspace. Our distributed algorithm can be applied to a wide range of statistical problems based on PCA, such as principal component regression and single index model. Finally, We provide simulation studies to demonstrate the performance of the proposed distributed estimator.
翻译:现代数据集规模不断扩大,给现有的统计估算方法带来了许多挑战,这就要求采用新的分布式方法。本文研究对基本的统计机器学习问题,即主要组成部分分析(PCA)进行了估算。尽管在顶层源数估算上有大量文献,但对于顶层-L$-dim(L>1美元)的天平估计却少得多,特别是以分布式方式。我们建议了为分布式数据构建顶层-L$-dim igenspace的新的多轮算法。我们的算法利用了转换和反转的先决条件和凝固优化。我们的估算器具有通信效率,并实现了快速的趋同率。与现有的鸿沟和征服算法相比,我们的方法对机器的数量没有限制。理论上,传统的Davis-Kahan 理论要求明确的eigengap假设来估算顶层-L$- dim 模型 igenspace。为了放弃这种分布式的假设,我们考虑在我们的分析中采用一种新的路径:而不是精确地确定顶层-美元-美元-direal-digraphy 和顶层的图像分析,我们能够进行一个基于最高层的统计的计算,我们能够用来进行一个基于最高层的统计分析的计算。