Distributed learning provides an attractive framework for scaling the learning task by sharing the computational load over multiple nodes in a network. Here, we investigate the performance of distributed learning for large-scale linear regression where the model parameters, i.e., the unknowns, are distributed over the network. We adopt a statistical learning approach. In contrast to works that focus on the performance on the training data, we focus on the generalization error, i.e., the performance on unseen data. We provide high-probability bounds on the generalization error for both isotropic and correlated Gaussian data as well as sub-gaussian data. These results reveal the dependence of the generalization performance on the partitioning of the model over the network. In particular, our results show that the generalization error of the distributed solution can be substantially higher than that of the centralized solution even when the error on the training data is at the same level for both the centralized and distributed approaches. Our numerical results illustrate the performance with both real-world image data as well as synthetic data.
翻译:分布式学习通过共享网络中多个节点的计算负荷,为扩大学习任务提供了一个有吸引力的框架。 在这里, 我们调查了大规模线性回归的分布式学习的绩效, 模型参数( 未知数) 分布在网络中。 我们采用了统计学习方法。 与侧重于培训数据绩效的工程相比, 我们侧重于一般化错误, 即隐蔽数据的绩效。 我们提供了异地和相关高斯数据以及亚高斯数据的一般化错误的高度概率界限。 这些结果揭示了一般化绩效对模型在网络上分布的依赖性。 特别是, 我们的结果显示, 分布式解决方案的普遍错误可能大大高于集中和分布式方法的集中化解决方案。 我们的数字结果显示了真实世界图像数据以及合成数据的绩效。