A major obstacle to achieving global convergence in distributed and federated learning is the misalignment of gradients across clients, or mini-batches due to heterogeneity and stochasticity of the distributed data. In this work, we show that data heterogeneity can in fact be exploited to improve generalization performance through implicit regularization. One way to alleviate the effects of heterogeneity is to encourage the alignment of gradients across different clients throughout training. Our analysis reveals that this goal can be accomplished by utilizing the right optimization method that replicates the implicit regularization effect of SGD, leading to gradient alignment as well as improvements in test accuracies. Since the existence of this regularization in SGD completely relies on the sequential use of different mini-batches during training, it is inherently absent when training with large mini-batches. To obtain the generalization benefits of this regularization while increasing parallelism, we propose a novel GradAlign algorithm that induces the same implicit regularization while allowing the use of arbitrarily large batches in each update. We experimentally validate the benefits of our algorithm in different distributed and federated learning settings.
翻译:在分布式和联结式学习中实现全球趋同的一个主要障碍是客户之间梯度或小型桶因分布式数据的异质性和差异性而造成的梯度失调,或小型桶因分布式数据的异性和差异性而导致的微梯度失调。在这项工作中,我们表明,数据异质性实际上可以通过隐含的正规化来加以利用,以提高一般化绩效。减轻异质性影响的一个办法是鼓励不同客户在整个培训过程中对梯度进行调整。我们的分析表明,这一目标可以通过使用正确的优化方法来实现,该方法可以复制SGD的隐性正规化效应,导致梯度调整以及测试适应性方面的改进。由于SGD存在这种正规化,完全取决于培训期间对不同小型桶的连续使用,因此在使用大型微型桶进行的培训时就必然不存在。为了获得这种正规化的普遍效益,同时增加平行主义,我们建议采用新的格拉德Align算法,在每次更新中引入同样的隐性正规化,同时允许使用任意的大型批量的批量。我们实验性地验证了我们在不同分布式和节制学习环境中的算法的好处。