In this work, we propose an optimization algorithm which we call norm-adapted gradient descent. This algorithm is similar to other gradient-based optimization algorithms like Adam or Adagrad in that it adapts the learning rate of stochastic gradient descent at each iteration. However, rather than using statistical properties of observed gradients, norm-adapted gradient descent relies on a first-order estimate of the effect of a standard gradient descent update step, much like the Newton-Raphson method in many dimensions. Our algorithm can also be compared to quasi-Newton methods, but we seek roots rather than stationary points. Seeking roots can be justified by the fact that for models with sufficient capacity measured by nonnegative loss functions, roots coincide with global optima. This work presents several experiments where we have used our algorithm; in these results, it appears norm-adapted descent is particularly strong in regression settings but is also capable of training classifiers.
翻译:在这项工作中,我们建议了一种优化算法,我们称之为标准适应梯度下降。这种算法类似于亚当或阿达格勒等其他基于梯度的优化算法,因为它适应了每次迭代的随机梯度梯度下降的学习率。然而,规范适应梯度下降不是使用所观察到的梯度统计特性,而是依赖于对标准梯度下降更新步骤效应的一阶估计,这与牛顿-拉夫森方法在许多方面非常相似。我们的算法也可以与准纽顿方法相比较,但我们寻求根点而不是固定点。寻找根点的理由可能是,因为对于以非负损失函数衡量的足够能力模型,根与全球奥地马相吻合。这项工作提出了几项我们使用我们算法的实验;在这些结果中,规范适应梯度下降在回归环境中显得特别强烈,但也能够培训分类者。