Communication overhead severely hinders the scalability of distributed machine learning systems. Recently, there has been a growing interest in using gradient compression to reduce the communication overhead of the distributed training. However, there is little understanding of applying gradient compression to adaptive gradient methods. Moreover, its performance benefits are often limited by the non-negligible compression overhead. In this paper, we first introduce a novel adaptive gradient method with gradient compression. We show that the proposed method has a convergence rate of $\mathcal{O}(1/\sqrt{T})$ for non-convex problems. In addition, we develop a scalable system called BytePS-Compress for two-way compression, where the gradients are compressed in both directions between workers and parameter servers. BytePS-Compress pipelines the compression and decompression on CPUs and achieves a high degree of parallelism. Empirical evaluations show that we improve the training time of ResNet50, VGG16, and BERT-base by 5.0%, 58.1%, 23.3%, respectively, without any accuracy loss with 25 Gb/s networking. Furthermore, for training the BERT models, we achieve a compression rate of 333x compared to the mixed-precision training.
翻译:最近,人们越来越有兴趣使用梯度压缩来减少分布式培训的通信费。然而,对于将梯度压缩应用到适应性梯度方法,人们几乎没有什么了解。此外,其性能效益往往受到非忽略性压缩管理的限制。在本文中,我们首先采用一个新的适应性梯度方法,采用梯度压缩。我们显示,拟议方法对于非convex问题,其趋同率为$\mathcal{O}(1/\ sqrt{T})$。此外,我们开发了一个称为双向压缩的BytePS-Compress(ByPS-Compress)(双向压缩)系统,在工人和参数服务器之间双向压缩梯度。BytePS-Compress(ByPS-Compress)管道对CPUs进行压缩和降压,并实现高度的平行。我们提出的方法显示,我们分别将ResNet50、VGGGG16和BERT-B的训练时间提高5.0%、58.1%、23.3%(我们没有因25 Gb/Cisx的混合培训而使BER-RMLMLA而降低的训练率提高。