In this work, we develop a distributed least squares approximation (DLSA) method that is able to solve a large family of regression problems (e.g., linear regression, logistic regression, and Cox's model) on a distributed system. By approximating the local objective function using a local quadratic form, we are able to obtain a combined estimator by taking a weighted average of local estimators. The resulting estimator is proved to be statistically as efficient as the global estimator. Moreover, it requires only one round of communication. We further conduct a shrinkage estimation based on the DLSA estimation using an adaptive Lasso approach. The solution can be easily obtained by using the LARS algorithm on the master node. It is theoretically shown that the resulting estimator possesses the oracle property and is selection consistent by using a newly designed distributed Bayesian information criterion (DBIC). The finite sample performance and computational efficiency are further illustrated by an extensive numerical study and an airline dataset. The airline dataset is 52 GB in size. The entire methodology has been implemented in Python for a {\it de-facto} standard Spark system. The proposed DLSA algorithm on the Spark system takes 26 minutes to obtain a logistic regression estimator, which is more efficient and memory friendly than conventional methods.
翻译:在这项工作中,我们开发了一个分布式最小方近似(DLSA)方法,该方法能够在分布式系统中解决大量回归问题(如线性回归、后勤回归和Cox的模型)。通过使用局部二次方形对本地目标函数进行近似,我们可以通过使用本地二次方形,通过使用当地平均估测器的加权平均数获得一个合并估计值。由此得出的估计值在统计上证明与全球估测器一样有效。此外,它只需要一轮通信。我们进一步根据DLSA的估算,使用适应性拉索方法进行缩小估计。通过在主节点上使用LARS算法可以很容易地获得解决方案。理论上显示,由此产生的估计值拥有或触摸属性,并且通过使用新设计的分布式巴耶斯信息标准标准标准标准标准(DBIC)来进行选择。通过广泛的数字研究和航空数据集进一步说明有限的样本性能和计算效率。航空数据集是52GB的大小。整个方法已经在Pyrassimateal Spassimation系统中采用一个比Spassimal-Spassimal系统更符合Spassimal。