Distributed machine learning has become an indispensable tool for training large supervised machine learning models. To address the high communication costs of distributed training, which is further exacerbated by the fact that modern highly performing models are typically overparameterized, a large body of work has been devoted in recent years to the design of various compression strategies, such as sparsification and quantization, and optimization algorithms capable of using them. Recently, Safaryan et al (2021) pioneered a dramatically different compression design approach: they first use the local training data to form local {\em smoothness matrices}, and then propose to design a compressor capable of exploiting the smoothness information contained therein. While this novel approach leads to substantial savings in communication, it is limited to sparsification as it crucially depends on the linearity of the compression operator. In this work, we resolve this problem by extending their smoothness-aware compression strategy to arbitrary unbiased compression operators, which also includes sparsification. Specializing our results to quantization, we observe significant savings in communication complexity compared to standard quantization. In particular, we show theoretically that block quantization with $n$ blocks outperforms single block quantization, leading to a reduction in communication complexity by an $\mathcal{O}(n)$ factor, where $n$ is the number of nodes in the distributed system. Finally, we provide extensive numerical evidence that our smoothness-aware quantization strategies outperform existing quantization schemes as well the aforementioned smoothness-aware sparsification strategies with respect to all relevant success measures: the number of iterations, the total amount of bits communicated, and wall-clock time.
翻译:分散的机器学习已成为培训大型受监督的机器学习模式的一个不可或缺的工具。 为解决分布式培训的通信费用高昂的问题,由于现代高性模型通常被过度分解,使得分布式培训的通信费用更加高。 近年来,大量工作被用于设计各种压缩战略,例如垃圾化和量化,以及能够使用这些战略的优化算法。 最近,萨法里扬等人(2021年)开创了一种截然不同的压缩设计方法:他们首先利用当地培训数据形成本地的平滑度矩阵,然后提议设计一个能够利用其中的平滑度信息的压缩机。 虽然这种新颖式方法导致通信方面大量节省,但是它却有限地用于缓解性,因为它关键地取决于压缩操作员的直线性。 在这项工作中,我们通过将平滑度-觉压缩战略扩大到任意的公平性压缩操作者,这也包括松动性。 将我们的结果专门用于量化,我们观察到通信的复杂性与标准平整化相比有很大的节省。 特别是,我们从理论上显示,平价平价的平价的平价平价的平价的平整方法, 使得整个平整的平整的平整的平整式系统比分数。