The learning rate is perhaps the single most important parameter in the training of neural networks and, more broadly, in stochastic (nonconvex) optimization. Accordingly, there are numerous effective, but poorly understood, techniques for tuning the learning rate, including learning rate decay, which starts with a large initial learning rate that is gradually decreased. In this paper, we present a general theoretical analysis of the effect of the learning rate in stochastic gradient descent (SGD). Our analysis is based on the use of a learning-rate-dependent stochastic differential equation (lr-dependent SDE) that serves as a surrogate for SGD. For a broad class of objective functions, we establish a linear rate of convergence for this continuous-time formulation of SGD, highlighting the fundamental importance of the learning rate in SGD, and contrasting to gradient descent and stochastic gradient Langevin dynamics. Moreover, we obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Witten-Laplacian, a special case of the Schr\"odinger operator associated with the lr-dependent SDE. Strikingly, this expression clearly reveals the dependence of the linear convergence rate on the learning rate -- the linear rate decreases rapidly to zero as the learning rate tends to zero for a broad class of nonconvex functions, whereas it stays constant for strongly convex functions. Based on this sharp distinction between nonconvex and convex problems, we provide a mathematical interpretation of the benefits of using learning rate decay for nonconvex optimization.
翻译:学习率或许是神经网络培训中最重要的单一参数,更广义地说,是神经网络优化中最重要的单一参数。因此,有很多有效但不易理解的调整学习率的技术,包括学习率衰减,首先从大量初始学习率开始,逐渐下降。在本文中,我们对学习率在随机梯度梯度下坡(SGD)中的影响进行了一般性的理论分析。我们的分析基于使用学习率依赖性随机差分方程式(Lr-suid SDE)作为 SGD的代金。对于广泛的目标功能,我们为SGD的这种连续时间制式组合制定了线性趋同率,强调SGD的学习率的根本重要性,并与渐渐渐下降的梯度梯度梯度梯度下降。此外,我们通过分析Witnite-Laplaceian的频谱,这是Schr\'od运算操作员的一个特殊案例,它作为SGDGD的替代品。对于广泛的目标功能来说,我们为SDGDG的直线性解释提供了一种直线性递递率率率率的直径递递递递递递递递递递下降率的直度递递递递递递递率的不至直线性比率。