Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze mini-batch SGD for linear models at different momenta and sizes of batches. Our key idea is to describe the loss value sequence in terms of its generating function, which can be written in a compact form assuming a diagonal approximation for the second moments of model weights. By analyzing this generating function, we deduce various conclusions on the convergence conditions, phase structure of the model, and optimal learning settings. As a few examples, we show that 1) the optimization trajectory can generally switch from the "signal-dominated" to the "noise-dominated" phase, at a time scale that can be predicted analytically; 2) in the "signal-dominated" (but not the "noise-dominated") phase it is favorable to choose a large effective learning rate, however its value must be limited for any finite batch size to avoid divergence; 3) optimal convergence rate can be achieved at a negative momentum. We verify our theoretical predictions by extensive experiments with MNIST and synthetic problems, and find a good quantitative agreement.
翻译:具有动力的微型批量 SGD 是学习大型预测模型的基本算法。 在本文中, 我们开发了一个新的分析框架, 用于分析不同时段和批量大小的线性模型的微型批量 SGD 。 我们的关键想法是描述其生成功能的损失值序列, 可以用缩略语形式写成, 假设模型重量第二时段的对角近似值。 通过分析这一生成功能, 我们推断出关于聚合条件、 模型的阶段结构以及最佳学习设置的各种结论 。 在几个例子中, 我们显示:(1) 优化轨迹一般可以从“ 信号主宰” 转换到“ 噪音主宰” 阶段, 其时间范围可以分析预测; 2 在“ 信号主宰” 阶段( 但不是“ 噪音主宰” ), 可以选择一个大的有效学习率, 但对于任何有限的批量规模, 其价值必须有限, 以避免差异; 3) 最佳汇合率可以负动。 我们通过对MNIST 和合成问题进行广泛的实验来验证我们的理论预测, 并找到一个好的量化协议 。