Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training of large deep neural networks (DNNs). Under the error-feedback framework, Top-$k$ sparsification, sometimes with $k$ as little as $0.1\%$ of the gradient size, enables training to the same model quality as the uncompressed case for a similar iteration count. From the optimization perspective, we find that Top-$k$ is the communication-optimal sparsifier given a per-iteration $k$ element budget. We argue that to further the benefits of gradient sparsification, especially for DNNs, a different perspective is necessary -- one that moves from per-iteration optimality to consider optimality for the entire training. We identify that the total error -- the sum of the compression errors for all iterations -- encapsulates sparsification throughout training. Then, we propose a communication complexity model that minimizes the total error under a communication budget for the entire training. We find that the hard-threshold sparsifier, a variant of the Top-$k$ sparsifier with $k$ determined by a constant hard-threshold, is the optimal sparsifier for this model. Motivated by this, we provide convex and non-convex convergence analyses for the hard-threshold sparsifier with error-feedback. Unlike with Top-$k$ sparsifier, we show that hard-threshold has the same asymptotic convergence and linear speedup property as SGD in the convex case and has no impact on the data-heterogeneity in the non-convex case. Our diverse experiments on various DNNs and a logistic regression model demonstrated that the hard-threshold sparsifier is more communication-efficient than Top-$k$.
翻译:重力压缩是处理大型深神经网络(DNN)分布式培训中通信瓶颈的广泛既定补救措施。 在错误反馈框架下, 顶价- $k$ 垃圾化, 有时与梯度大小的0.1美元相比, 低价- 美元, 使培训能够达到与类似迭代的未压缩案例相同的模式质量。 从优化角度看, 我们发现顶价- k$ 是通信- 最优化的垃圾过滤器, 给出了每平调 $ 元素预算。 我们认为, 要进一步提升梯度加速的好处, 特别是对 DNNNW 而言, 需要一种不同的观点 -- 一种从一次点优化到整个培训的最佳度的美元。 我们确定总错误 -- -- 即所有迭代值的压缩错误总和 -- 封装加。 然后, 我们提出一个通信复杂性模型预算下的全部错误。 我们发现, 硬质的垃圾过滤器、 上值- 美元 硬性硬性硬性硬性硬性的硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性