A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent work proposes gradient and model compression methods. In this work, we evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD across more than 200 different setups. Surprisingly, we observe that only in 6 cases out of more than 200, gradient compression methods provide speedup over optimized synchronous data-parallel training in the typical data-center setting. We conduct an extensive investigation to identify the root causes of this phenomenon, and offer a performance model that can be used to identify the benefits of gradient compression for a variety of system setups. Based on our analysis, we propose a list of desirable properties that gradient compression methods should satisfy, in order for them to provide a meaningful end-to-end speedup.
翻译:大量先前的工作突出表明了同步数据平行培训中的通信瓶颈问题。为了缓解这些瓶颈问题,一长串近期工作提出了梯度和模型压缩方法。在这项工作中,我们评估梯度压缩方法的功效,并将其可缩放性与在200多个不同设置中最佳实施同步数据平行 SGD 的方法进行比较。令人惊讶的是,我们发现,在200多个设置中,只有6个案例中,梯度压缩方法为在典型数据中心设置中最优化同步数据平行培训提供了速度。我们开展了广泛的调查,以查明这一现象的根源,并提供一种绩效模型,用以确定梯度压缩对各种系统设置的好处。我们根据我们的分析,提出了梯度压缩方法应该满足的可取属性清单,以便提供有意义的端到端的快速。