步骤衰减时间表:最低广场接近最佳、几何下降的学习率程序 (The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares)

from arxiv, Appears in the proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019. 28 pages, 4 tables, 1 Algorithm, 7 figures

Minimax optimal convergence rates for classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, SGD's final iterate behavior has received much less attention despite their widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations SGD is run for) is known in advance, SGD's final iterate behavior with any polynomially decaying learning rate scheme is highly sub-optimal compared to the minimax rate (by a condition number factor in the strongly convex case and a factor of $\sqrt{T}$ in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offers significant improvements over any polynomially decaying step sizes. In particular, the final iterate behavior with a step decay schedule is off the minimax rate by only $log$ factors (in the condition number for strongly convex case, and in T for the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD's final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the stepsizes employed. These results demonstrate the subtlety in establishing optimal learning rate schemes (for the final iterate) for stochastic gradient procedures in fixed time horizon settings.

翻译：类似地, SGD 最后的迭代行为尽管在实践中广泛使用,却得到的关注却少得多。受此观察的推动, 这项工作提供了对下列问题的详细研究: 使用 SGD 最终的代谢率, 最小平方的回归问题可以达到什么率? 首先, 这项工作表明, 即使时间范围T( e. 迭代性 SGD ) 预知, 且其跨步的缩进性下降。相比之下, SGD 最后的迭代性行为尽管在实践中广泛使用, 却得到的关注要少得多。与微缩增速率相比, 相对而言, SGDT 的最终代行进性( 以极低的直流性直流性直升率为条件系数, 在非直流性平方的平方的递性递性下降率中, 相对而言, 其最终的递进进进制性变速率( 以每恒定的递进制性变现速度率来显示 ) 。