Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: natural compression (NC). Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa. We show that compared to no compression, NC increases the second moment of the compressed vector by not more than the tiny factor $\frac{9}{8}$, which means that the effect of NC on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by NC are substantial, leading to $3$-$4\times$ improvement in overall theoretical running time. For applications requiring more aggressive compression, we generalize NC to natural dithering, which we prove is exponentially better than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect and offer new state-of-the-art both in theory and practice.
翻译:现代深层次的学习模式往往在一系列分布式机器的同时进行训练,以减少培训时间。在这种环境下,机器之间交流模型更新会成为一个重大的性能瓶颈,并提议了各种损失性更新压缩技术来缓解这一问题。在这项工作中,我们引入了一种新的、简单但理论上和实际上有效的压缩技术:自然压缩(NC)。我们的技术被单独应用到即将压缩的更新矢量的所有条目上,并且通过随机四舍五入到最近的两种(负或正)功率上的工作,而两种功率可以用“自然”的方式计算,而这种功率可以通过忽略曼蒂萨的方式计算出来。我们表明,与不压缩相比,NC相比,压缩矢量的第二刻度增加了不大于微小的因子$\frac{9 ⁇ 8},这意味着NC对流行培训算法(如分布式SGDD)的趋同速度的影响微不足道。然而,NC所节省的通信量很大,在总体理论运行时间里可以达到3-4美元-时间的改进。对于需要更积极性压缩的应用程序,我们将NC推广到自然抖动的理论理论,我们所运用的操作者们会更好地使用。