Adjusting the learning rate schedule in stochastic gradient methods is an important unresolved problem which requires tuning in practice. If certain parameters of the loss function such as smoothness or strong convexity constants are known, theoretical learning rate schedules can be applied. However, in practice, such parameters are not known, and the loss function of interest is not convex in any case. The recently proposed batch normalization reparametrization is widely adopted in most neural network architectures today because, among other advantages, it is robust to the choice of Lipschitz constant of the gradient in loss function, allowing one to set a large learning rate without worry. Inspired by batch normalization, we propose a general nonlinear update rule for the learning rate in batch and stochastic gradient descent so that the learning rate can be initialized at a high value, and is subsequently decreased according to gradient observations along the way. The proposed method is shown to achieve robustness to the relationship between the learning rate and the Lipschitz constant, and near-optimal convergence rates in both the batch and stochastic settings ($O(1/T)$ for smooth loss in the batch setting, and $O(1/\sqrt{T})$ for convex loss in the stochastic setting). We also show through numerical evidence that such robustness of the proposed method extends to highly nonconvex and possibly non-smooth loss function in deep learning problems.Our analysis establishes some first theoretical understanding into the observed robustness for batch normalization and weight normalization.
翻译:调整悬浮梯度方法的学习进度表是一个重要的未决问题,需要在实践中加以调整。如果知道某些损失函数参数,例如平滑或强烈的顺流常数,则可以适用理论学习进度表。然而,在实践中,这些参数并不为人所知,而利息损失函数在任何情况下都不是曲线。最近提议的批次正常化重新平衡在当今大多数神经网络结构中被广泛采用,因为除其他优点外,它对于选择损失函数中的利普施奇茨常数具有很强性,允许一个人设置一个大学习率而不必担心。受批量正常化的启发,我们提议对批次和随机梯度梯度梯度下降的学习率实行非线性更新总规则,这样学习速度就可以以高价值初始化,随后根据沿路的梯度观察而减少。 拟议的方法显示,在批次和偏差环境中的梯度常数常数常数和近于精度的精度趋性趋一致率之间,允许一个人设置一个大型学习率,而不必担心。在批次正常化中,我们提出一个非线性更新更新更新的学习速度率值值值值值值值值值值分析。我们在批次中确定不平稳的学习速度和直达的损的准确度上,也显示不折值损失的准确度分析。