Rapid growth in data sets and the scale of neural network architectures have rendered distributed training a necessity. A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, the machine learning community has largely focused on developing gradient and model compression methods. In parallel, the systems community has adopted several High Performance Computing (HPC)techniques to speed up distributed training. In this work, we evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD. Surprisingly, we observe that due to computation overheads introduced by gradient compression, the net speedup over vanilla data-parallel training is marginal, if not negative. We conduct an extensive investigation to identify the root causes of this phenomenon, and offer a performance model that can be used to identify the benefits of gradient compression for a variety of system setups. Based on our analysis, we propose a list of desirable properties that gradient compression methods should satisfy, in order for them to provide a meaningful end-to-end speedup
翻译:数据组和神经网络结构规模的快速增长使得有必要开展分布式培训。大量先前的工作突出表明了同步数据平行培训中存在通信瓶颈的问题。为缓解这些瓶颈,机器学习界主要侧重于开发梯度和模型压缩方法。与此同时,系统界采用了若干高性能计算技术来加快分布式培训。在这项工作中,我们评估了梯度压缩方法的功效,并将其可扩缩性与同步数据平行 SGD的最佳实施方法相比较。令人惊讶的是,我们发现,由于计算梯度压缩引入的间接费用,香草数据平行培训的净增速即使不是负面,也是微不足道的。我们进行了广泛的调查,以查明这一现象的根源,并提供一种能用来确定各种系统设置梯度压缩的好处的性能模型。我们根据我们的分析,提出了一份梯度压缩方法应满足的可取性能清单,以便它们提供有意义的端至端速度。