An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of information exchanged is quickly growing. While compression techniques have been introduced to cope with the latter, none has considered leveraging the temporal correlations that exist in consecutive vector updates. An important example is distributed momentum-SGD where temporal correlation is enhanced by the low-pass-filtering effect of applying momentum. In this paper we design and analyze compression methods that exploit temporal correlation in systems both with and without error-feedback. Experiments with the ImageNet dataset demonstrate that our proposed methods offer significant reduction in the rate of communication at only a negligible increase in computation complexity. We further analyze the convergence of SGD when compression is applied with error-feedback. In the literature, convergence guarantees are developed only for compressors that provide error-bounds point-wise, i.e., for each input to the compressor. In contrast, many important codes (e.g. rate-distortion codes) provide error-bounds only in expectation and thus provide a more general guarantee. In this paper we prove the convergence of SGD under an expected error assumption by establishing a bound for the minimum gradient norm.
翻译:分散化优化中日益加大的瓶颈是沟通。大模型和不断增长的数据集意味着计算权力下放很重要,而且交换的信息数量正在迅速增加。虽然已经采用压缩技术来应对后者,但没有考虑利用连续矢量更新中存在的时间相关性。一个重要的例子是分配动力-SGD,应用动力的低通道过滤效应加强了时间相关性。在本文中,我们设计和分析压缩方法,利用系统与不错误反馈之间的时间相关性。对图像网数据集的实验表明,我们提议的通信速度大幅下降,而计算复杂性仅略有增加。我们进一步分析在压缩时与错误反射时SGD的趋同。在文献中,只为压缩器开发了趋同保证,这些压缩器提供误差的点,即向压缩器提供的每项输入。相反,许多重要的代码(例如,率扭曲代码)仅提供预期的错误限制,从而提供了更普遍的保证。在本文中,我们通过设定标准下的最低标准是稳定度,我们证明SGDGD下的最低标准。