It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., (Li et al., 2019)) only applies to SGD with tiny LR. Experimental verification of the approximation appears computationally infeasible. The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Ito SDE approximation. (b) A theoretically motivated testable necessary condition for the SDE approximation and its most famous implication, the linear scaling rule (Goyal et al., 2017), to hold. (c) Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets.
翻译:人们普遍承认,与微小LR相比,有限学习率(LR)对于在实际生活中深网中很好地推广十分重要,大多数试图解释的解释都提议与Ito Stopchatic 差别(SDEs)相似的有限LR SGD(SDEs),但这种近似化的正式理由(例如(Li等人,2019年))只适用于小LR(SGD),而这种微小的SGD(LR)(LI等人,2019年)。对近似的实验性核查似乎在计算上不可行。本文件以下列贡献来澄清了这一图景:(a) 高效的模拟算法SVAG,可与传统使用的Ito SDE近似(SDE)相统一。 (b) 一种具有理论动机的测试性的必要条件,即SDE近似称及其最著名的含义,即线度规则(Goyal等人,201717年),以维持。 (c) 利用这种模拟试验表明先前提议的SDE近似可以有意义地捕捉到共同深网的训练和一般特性。