Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling, e.g., Random Reshuffling (RR) method, perform better than ones that sample the gradients with-replacement. In this work, we close this gap in the literature and provide the first analysis of methods with gradient compression and without-replacement sampling. We first develop a distributed variant of random reshuffling with gradient compression (Q-RR), and show how to reduce the variance coming from gradient quantization through the use of control iterates. Next, to have a better fit to Federated Learning applications, we incorporate local computation and propose a variant of Q-RR called Q-NASTYA. Q-NASTYA uses local gradient steps and different local and global stepsizes. Next, we show how to reduce compression variance in this setting as well. Finally, we prove the convergence results for the proposed methods and outline several settings in which they improve upon existing algorithms.
翻译:渐变压缩是提高机器学习模型分布式培训中随机第一阶方法的通信复杂性的一种流行技术。然而,现有工作只考虑替换随机梯度抽样;相反,在实践中广为人知,最近理论也证实,基于不替换抽样的随机调整方法,例如随机调整方法,比以更替方式取样梯度和更替方式进行抽样的方法效果更好。在这项工作中,我们缩小文献中的这一差距,对梯度压缩和不替换抽样的方法进行首次分析。我们首先开发一个随机调整梯度压缩(Q-RRR)的分布式变异,并展示如何通过使用控制梯度来减少梯度四分化产生的差异。接下来,为了更适合联邦学习应用,我们采用了本地计算方法,并提出了称为Q-NASTYA的Q-NASTYA变异体。Q-NASTYA使用本地梯度步骤和不同的本地和全球步骤进行首次分析。我们首先开发了与梯度压缩缩缩压(Q-NASTYA)相匹配的分布式变种方法,我们最后展示了如何减少梯度差异的组合。我们最后在设定了几种压缩方法上的组合。