Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models, and lossy compression of exchanged information, such as stochastic gradients or models, is one of the most effective instruments to alleviate this issue. Among the most studied compression techniques is the class of unbiased compression operators with variance bounded by a multiple of the square norm of the vector we wish to compress. By design, this variance may remain high, and only diminishes if the input vector approaches zero. However, unless the model being trained is overparameterized, there is no a-priori reason for the vectors we wish to compress to approach zero during the iterations of classical methods such as distributed compressed {\sf SGD}, which has adverse effects on the convergence speed. Due to this issue, several more elaborate and seemingly very different algorithms have been proposed recently, with the goal of circumventing this issue. These methods are based on the idea of compressing the {\em difference} between the vector we would normally wish to compress and some auxiliary vector which changes throughout the iterative process. In this work we take a step back, and develop a unified framework for studying such methods, conceptually, and theoretically. Our framework incorporates methods compressing both gradients and models, using unbiased and biased compressors, and sheds light on the construction of the auxiliary vectors. Furthermore, our general framework can lead to the improvement of several existing algorithms, and can produce new algorithms. Finally, we performed several numerical experiments which illustrate and support our theoretical findings.
翻译:通信是大规模机器学习模型分布培训的关键瓶颈之一,而交换信息(如随机梯度或模型)的流失压缩是缓解这一问题的最有效手段之一。研究最多的压缩技术是无偏倚的压缩操作者类别,其差异受我们想要压缩的矢量的多种平方规范的制约。从设计上看,这种差异可能仍然很高,只有在输入矢量接近零时才会减少。然而,除非所培训的模型过于分化,否则,我们所希望的矢量在传播传统方法(如分发压缩的Ssf SGD})的迭代期中压缩零接近交流信息,这是最有效的工具之一。最受研究最多的压缩技术是那些不偏颇的压缩压缩操作者。由于这个问题,最近提出了几项更为详细和似乎非常不同的算法,目的是绕过这个问题。这些方法的基础是压缩矢量值差异的理念。除非所培训的模型过于分化,否则,没有任何理由将矢量压缩和某些辅助矢量在迭代过程中发生变化。在这项工作中,我们迈出了一步后,将采用一个理论框架和理论框架,从而研究我们现有的不断修正的推算。