Lossy gradient compression has become a practical tool to overcome the communication bottleneck in centrally coordinated distributed training of machine learning models. However, algorithms for decentralized training with compressed communication over arbitrary connected networks have been more complicated, requiring additional memory and hyperparameters. We introduce a simple algorithm that directly compresses the model differences between neighboring workers using low-rank linear compressors applied on model differences. Inspired by the PowerSGD algorithm for centralized deep learning, this algorithm uses power iteration steps to maximize the information transferred per bit. We prove that our method requires no additional hyperparameters, converges faster than prior methods, and is asymptotically independent of both the network and the compression. Out of the box, these compressors perform on par with state-of-the-art tuned compression algorithms in a series of deep learning benchmarks.
翻译:丧失的梯度压缩已成为克服中央协调分布式机器学习模式培训中通信瓶颈的一个实用工具。 然而,通过任意连接网络进行压缩通信的分散化培训的算法更为复杂,需要额外的内存和超参数。 我们引入了一个简单的算法,直接压缩使用模型差异应用的低级线性压缩机的相邻工人之间的模型差异。在中央深层学习的PowerSGD算法的启发下,这种算法使用电动转动步骤来最大限度地增加所传输的信息。 我们证明,我们的方法不需要额外的超参数,比以前的方法要快,并且与网络和压缩无关。 从这个框中,这些压缩器在一系列深层学习基准中与最先进的调整压缩算法一样。