Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks due to gradient transmission. To address this issue, entire families of lossy gradient compression mechanisms have been developed, including quantization, sparsification, and low-rank approximation, some of which are seeing significant practical adoption. Despite this progress, almost all known compression schemes apply compression uniformly across DNN layers, although layers are heterogeneous in terms of parameter count and their impact on model accuracy. In this work, we provide a general framework for adapting the degree of compression across the model's layers dynamically during training, significantly improving the overall compression without sacrificing accuracy. Our framework, called L-GreCo, is based on an efficient adaptive algorithm, which automatically picks the optimal compression parameters for model layers guaranteeing the best compression ratio while respecting a theoretically-justified error constraint. Our extensive experimental study over image classification and language modeling tasks shows that L-GreCo is effective across all three compression families, and achieves up to 2.5$\times$ training speedup and up to 5$\times$ compression improvement over efficient implementations of standard approaches while recovering full accuracy. Moreover, we show that L-GreCo is complementary to existing adaptive algorithms improving their compression ratio by 50% and practical throughput by 66%.
翻译:对深神经网络(DNN)进行的数据分布式培训已获得非常广泛的采用,但是仍然会因梯度传输而经历通信瓶颈。为了解决这一问题,已经开发了所有损失梯度压缩机制的大家庭,包括量化、封闭和低位近似,其中一些正在得到大量实际采用。尽管取得了这一进展,几乎所有已知的压缩计划都统一地压入DNN层,尽管从参数计数和对模型准确性的影响来看,层次各异。在这项工作中,我们提供了一个总体框架,以在培训期间动态调整模型层的压缩程度,大大改善整体压缩,而不会牺牲准确性。我们称为L-GreCo的框架以高效的适应算法为基础,自动选择模型层的最佳压缩参数,既保证最佳压缩比率,又尊重理论上合理的误差限制。我们对图像分类和语言模型任务的广泛实验研究表明,L-GreCo在所有三个压缩家庭中都有效,培训速度达到2.5美元,培训速度达到5.00美元。我们称为L-G-G-G-G-Con的压缩改进幅度最高为50美元,同时恢复现有标准方法的精确性,通过全面恢复L-G-r-r-ral-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx。