Distributed statistical learning problems arise commonly when dealing with large datasets. In this setup, datasets are partitioned over machines, which compute locally, and communicate short messages. Communication is often the bottleneck. In this paper, we study one-step and iterative weighted parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, send the results to a central server, and take a weighted average of the parameters. Optionally, we iterate, sending back the weighted average and doing local ridge regressions centered at it. How does this work compared to doing linear regression on the full data? Here we study the performance loss in estimation, test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We find the performance loss in one-step weighted averaging, and also give results for iterative averaging. We also find that different problems are affected differently by the distributed framework. Estimation error and confidence interval length increase a lot, while prediction error increases much less. We rely on recent results from random matrix theory, where we develop a new calculus of deterministic equivalents as a tool of broader interest.
翻译:处理大型数据集时, 通常会出现分布式的统计学习问题。 在此设置中, 数据集在机器上分割, 在本地进行计算, 并传递短信息 。 通信往往是瓶颈。 在本文中, 我们研究统计线性模型中平均在数据平行模式下的一步和迭代加权参数。 我们在每台机器上进行线性回归, 将结果发送到中央服务器, 并使用参数的加权平均值 。 选择时, 我们重复, 发送加权平均值, 并进行以它为中心的局部脊柱回归 。 这项工作与在全部数据上进行线性回归相比, 如何进行? 我们在这里研究高维度的性能损失、 测试错误和信心间隔, 其参数数量与培训数据大小相当 。 我们用一步加权平均法来发现性绩效损失, 并给迭代平均效果带来结果 。 我们还发现, 不同的问题受到分布式框架的不同影响 。 刺激性差和信心间隔长度会增加很多, 而预测错误会增加很多 。 我们依靠随机矩阵理论的最新结果, 也就是我们开发新的确定性工具等同的兴趣。