Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling, e.g., Random Reshuffling (RR) method, perform better than ones that sample the gradients with-replacement. In this work, we close this gap in the literature and provide the first analysis of methods with gradient compression and without-replacement sampling. We first develop a na\"ive combination of random reshuffling with gradient compression (Q-RR). Perhaps surprisingly, but the theoretical analysis of Q-RR does not show any benefits of using RR. Our extensive numerical experiments confirm this phenomenon. This happens due to the additional compression variance. To reveal the true advantages of RR in the distributed learning with compression, we propose a new method called DIANA-RR that reduces the compression variance and has provably better convergence rates than existing counterparts with with-replacement sampling of stochastic gradients. Next, to have a better fit to Federated Learning applications, we incorporate local computation, i.e., we propose and analyze the variants of Q-RR and DIANA-RR -- Q-NASTYA and DIANA-NASTYA that use local gradient steps and different local and global stepsizes. Finally, we conducted several numerical experiments to illustrate our theoretical results.
翻译:重力压缩是提高机器学习模型分布式培训中随机第一阶方法通信复杂性的流行技术,但现有工作只考虑替换随机梯度抽样。相反,在实践中广为人知,最近在理论中也证实,基于不替换抽样的随机重压方法,例如随机重压方法,比抽取梯度并替换的梯度的方法效果更好。在这项工作中,我们缩小了文献中的这一差距,首次分析了梯度压缩和不替换抽样的方法。我们首先开发了随机重整梯度梯度(Q-RR)的自动组合。也许令人惊讶的是,对QRR的理论分析没有显示使用不替换取样的任何好处。我们的广泛数字实验证实了这一现象。这与额外的压缩差异有关。为了揭示DRIA在通过压缩进行分布学习中的真正优势,我们提出了一种名为DIAN-RRA的新方法,可以减少缩压差和不替换抽样。我们首先开发了随机重力A(Q-RRRA)的自动重整率组合率,并且将我们从现在的渐变的渐变的渐变的阶级和变数。