Gradient descent is slow to converge for ill-conditioned problems and non-convex problems. An important technique for acceleration is step-size adaptation. The first part of this paper contains a detailed review of step-size adaptation methods, including Polyak step-size, L4, LossGrad, Adam, IDBD, and Hypergradient descent, and the relation of step-size adaptation to meta-gradient methods. In the second part of this paper, we propose a new class of methods of accelerating gradient descent that have some distinctiveness from existing techniques. The new methods, which we call {\em step-size planning}, use the {\em update experience} to learn an improved way of updating the parameters. The methods organize the experience into $K$ steps away from each other to facilitate planning. From the past experience, our planning algorithm, Csawg, learns a step-size model which is a form of multi-step machine that predicts future updates. We extends Csawg to applying step-size planning multiple steps, which leads to further speedup. We discuss and highlight the projection power of the diagonal-matrix step-size for future large scale applications. We show for a convex problem, our methods can surpass the convergence rate of Nesterov's accelerated gradient, $1 - \sqrt{\mu/L}$, where $\mu, L$ are the strongly convex factor of the loss function $F$ and the Lipschitz constant of $F'$, which is the theoretical limit for the convergence rate of first-order methods. On the well-known non-convex Rosenbrock function, our planning methods achieve zero error below 500 gradient evaluations, while gradient descent takes about 10000 gradient evaluations to reach a $10^{-3}$ accuracy. We discuss the connection of step-size planing to planning in reinforcement learning, in particular, Dyna architectures. (This is a shorter abstract than in the paper because of length requirement)
翻译:渐降速度缓慢, 无法因不成熟的问题和非混凝土问题而趋同。 加速速度的一个重要技术是步进式适应。 本文第一部分详细审查了步进式适应方法, 包括Polyak 步级、 L4、 LostGrad、 Adam、 IDBD 和超梯级下降, 以及步进式适应与元进化方法的关系。 在本文第二部分, 我们建议了一种新的加速梯级下降的方法, 与现有技术有一定的区别。 新的方法, 我们称之为“ 步进式规划 ”, 使用 步进式更新经验 。 。 这些方法详细审查了步进式适应方法, 包括Polyak 步式大小、 L4, LostGradg, 以及超梯式调整模式, 这是一种预测未来更新的多步式机器 。 我们把Csawg 的步式递增缩进式规划, 导致进一步加速速度 。 我们讨论并突出调- 递进式递进式 递进式递升进式 速度, 的递进级 速度 动作 动作 功能可以显示 10 递化 递化 递进 速度 速度 速度 速度 速度 方法 。