Many popular learning-rate schedules for deep neural networks combine a decaying trend with local perturbations that attempt to escape saddle points and bad local minima. We derive convergence guarantees for bandwidth-based step-sizes, a general class of learning-rates that are allowed to vary in a banded region. This framework includes cyclic and non-monotonic step-sizes for which no theoretical guarantees were previously known. We provide worst-case guarantees for SGD on smooth non-convex problems under several bandwidth-based step sizes, including stagewise $1/\sqrt{t}$ and the popular step-decay (constant and then drop by a constant), which is also shown to be optimal. Moreover, we show that its momentum variant (SGDM) converges as fast as SGD with the bandwidth-based step-decay step-size. Finally, we propose some novel step-size schemes in the bandwidth-based family and verify their efficiency on several deep neural network training tasks.
翻译:许多深神经网络流行的学习速度表将衰败的趋势与试图逃离马鞍点和不良当地迷你的局部扰动结合起来。我们为基于带宽的阶梯尺寸提供趋同保证,这种宽带宽的学习速度一般可以在一个带宽的区域中有所变化。这个框架包括以前没有理论保证的循环和非流动的阶梯尺寸。我们向SGD提供一些基于带宽的阶梯尺寸下平稳的非康氏问题的最坏保证,包括分阶段的1美元(sqrt{t}$)和流行的继发式(恒定,然后以恒定速度下降),这也证明是最佳的。此外,我们表明其动力变异(SGD)与SGD一样快速地与基于带宽带的阶梯度梯度梯度尺寸相融合。最后,我们提议在基于带宽带的家庭中采取一些新型的阶梯规模计划,并核实其在若干深神经网络训练任务上的效率。