Local SGD is a communication-efficient variant of SGD for large-scale training, where multiple GPUs perform SGD independently and average the model parameters periodically. It has been recently observed that Local SGD can not only achieve the design goal of reducing the communication overhead but also lead to higher test accuracy than the corresponding SGD baseline (Lin et al., 2020b), though the training regimes for this to happen are still in debate (Ortiz et al., 2021). This paper aims to understand why (and when) Local SGD generalizes better based on Stochastic Differential Equation (SDE) approximation. The main contributions of this paper include (i) the derivation of an SDE that captures the long-term behavior of Local SGD in the small learning rate regime, showing how noise drives the iterate to drift and diffuse after it has reached close to the manifold of local minima, (ii) a comparison between the SDEs of Local SGD and SGD, showing that Local SGD induces a stronger drift term that can result in a stronger effect of regularization, e.g., a faster reduction of sharpness, and (iii) empirical evidence validating that having a small learning rate and long enough training time enables the generalization improvement over SGD but removing either of the two conditions leads to no improvement.
翻译:本地SGD是用于大规模培训的SGD的一个通信高效变体,在这种变体中,多个GPP单位独立地和定期地平均地执行SGD模型参数,最近观察到,当地SGD不仅能够实现减少通信间接费用的设计目标,而且能够提高测试准确性,高于相应的SGD基线(Lin等人,2020年b),尽管这方面的培训制度仍在辩论之中(Ortiz等人,2021年)。本文旨在了解为什么(和何时)地方SGD根据SDG的缩略(SDE)更概括化。本文的主要贡献包括:(一) 产生SDED(SGD) 反映当地SGD在小型学习率制度中的长期行为,表明噪音如何促使在接近当地小型迷你柱子后流动和扩散(Ortiz等人,2021年)。本文旨在了解为什么(和SGD) SDDS的SDD(SD) 比较当地SDGD(SD) 的SD(和SGD(SGD)的SGD(SGD(SGD)的 SGD(SGD)产生更强大的流化术语,这可以产生更强大的效果,从而更强大的效果,例如,但能更快地使SGD(快速的学习率得到足够的长期改进)或两个经验改进。</s>