The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k and random-k shown to be particularly effective. The same observation has also motivated a separate line of work in distributed statistical estimation theory focusing on the impact of communication constraints on the estimation efficiency of different statistical models. The primary goal of this paper is to connect these two research lines and demonstrate how statistical estimation models and their analysis can lead to new insights in the design of communication-efficient training techniques. We propose a simple statistical estimation model for the stochastic gradients which captures the sparsity and skewness of their distribution. The statistically optimal communication scheme arising from the analysis of this model leads to a new sparsification technique for SGD, which concatenates random-k and top-k, considered separately in the prior literature. We show through extensive experiments on both image and language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the concatenated application of these two sparsification methods consistently and significantly outperforms either method applied alone.
翻译:不同节点之间交换梯度的巨大通信成本大大限制了大型学习模式分布式培训的可扩展性。在这一观察的推动下,最近对降低分布式蒸馏梯子(SGD)通信成本的技术表现出浓厚的兴趣,这些技术显示其特别有效。同样的观察还促使在分布式统计估计理论方面开展单独的工作,侧重于通信限制对不同统计模型估计效率的影响。本文的主要目标是将这两个研究线联系起来,并展示统计估计模型及其分析如何在设计通信效率培训技术方面引领新的见解。我们提出了一个简单的统计估算梯度模型,以捕捉其分布的松散性和偏斜度。这一模型分析产生的统计上最优化的通信计划导致SGD采用新的垃圾化技术,SGD将随机K和顶点相连接,在以前的文献中分别考虑。我们通过广泛实验,展示了在图像和语言领域和语言领域设计通信效率高效益培训技术的新观点。我们提出了一个简单统计梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度模型的简单统计估算模型模型模型模型模型,这些连续型模型系统模型应用了两个连续型模型。