The recent many-fold increase in the size of deep neural networks makes efficient distributed training challenging. Many proposals exploit the compressibility of the gradients and propose lossy compression techniques to speed up the communication stage of distributed training. Nevertheless, compression comes at the cost of reduced model quality and extra computation overhead. In this work, we design an efficient compressor with minimal overhead. Noting the sparsity of the gradients, we propose to model the gradients as random variables distributed according to some sparsity-inducing distributions (SIDs). We empirically validate our assumption by studying the statistical characteristics of the evolution of gradient vectors over the training process. We then propose Sparsity-Inducing Distribution-based Compression (SIDCo), a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC) while being faster by imposing lower compression overhead. Our extensive evaluation of popular machine learning benchmarks involving both recurrent neural network (RNN) and convolution neural network (CNN) models shows that SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
翻译:最近深心神经网络规模的大幅增长导致高效分布式培训具有挑战性。许多建议利用梯度的压缩,并提议损失压缩技术,以加快分布式培训的通信阶段。然而,压缩是以模型质量降低和额外计算间接费用为代价的。在这项工作中,我们设计了一个高效压缩器,其管理费用最低;注意到梯度的广度,我们提议将梯度作为随机变量进行模型,根据某些微量诱导射分布(SIDs)进行分配。我们通过研究梯度矢量在培训过程中的演变的统计特征,实证了我们的假设。我们随后提出了基于分布制分布制的简化(SIDCo)(SIDCo)(SIDCo)(SIDCo)(Sparity-Induction-Induction-基于分配制压缩的缩压(SIDCo))(SIDCo)(SDGC),其阈值与深度缩压压压压(DGC(DGC)相似,同时速度更快。我们广泛评价了涉及经常性神经网络(RNN)和进神经网络(CNN)模式的流行机器学习基准,显示SIDCoDC(S-com-trade)将培训速度分别加速到41:7%、7/6、7/6、7%、7%、7%和顶压(T)和1.9%。