Recent works report that increasing the learning rate or decreasing the minibatch size in stochastic gradient descent (SGD) can improve test set performance. We argue this is expected under some conditions in models with a loss function with multiple local minima. Our main contribution is an approximate but analytical approach inspired by methods in Physics to study the role of the SGD learning rate and batch size in generalization. We characterize test set performance under a shift between the training and test data distributions for loss functions with multiple minima. The shift can simply be due to sampling, and is therefore typically present in practical applications. We show that the resulting shift in local minima worsens test performance by picking up curvature, implying that generalization improves by selecting wide and/or little-shifted local minima. We then specialize to SGD, and study its test performance under stationarity. Because obtaining the exact stationary distribution of SGD is intractable, we derive a Fokker-Planck approximation of SGD and obtain its stationary distribution instead. This process shows that the learning rate divided by the minibatch size plays a role analogous to temperature in statistical mechanics, and implies that SGD, including its stationary distribution, is largely invariant to changes in learning rate or batch size that leave its temperature constant. We show that increasing SGD temperature encourages the selection of local minima with lower curvature, and can enable better generalization. We provide experiments on CIFAR10 demonstrating the temperature invariance of SGD, improvement of the test loss as SGD temperature increases, and quantifying the impact of sampling versus domain shift in driving this effect. Finally, we present synthetic experiments showing how our theory applies in a simplified loss with two local minima.
 翻译:最近的工作报告指出,提高学习率或降低悬浮梯度下降的微缩缩缩缩缩(SGD)可以提高测试性能。我们认为,在具有多重本地迷你模型损失功能的模型中,某些条件下,这预计会提高测试性能。我们的主要贡献是物理学方法所启发的一种近似但分析方法,以研究SGD学习率和批量规模的概括性作用。我们把测试性能定位在培训和测试数据分配中的变化与多种迷你函数相适应。这种转变可能仅仅是由于抽样的缘故,因此通常在实际应用中出现。我们表明,当地微型缩略微缩缩缩缩缩缩缩缩缩缩缩缩缩的演化性能会提高测试性能,这意味着通过选择宽广和(或)小改动的本地迷你微缩缩缩略图来改进总体测试性能的性能。我们从SGDT得到精确的固定性分布,我们从SGDG的Fokker-Planc近似近,取得其固定性分布。这个过程表明,微缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩的演算在统计的温度变变化过程中的作用是相对的温度,在统计的温度变变变的变换,意味着SGDLILILILE,我们在S的递缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩的演的演的演的演的变。