Conventional wisdom dictates that learning rate should be in the stable regime so that gradient-based algorithms don't blow up. This letter introduces a simple scenario where an unstably large learning rate scheme leads to a super fast convergence, with the convergence rate depending only logarithmically on the condition number of the problem. Our scheme uses a Cyclical Learning Rate (CLR) where we periodically take one large unstable step and several small stable steps to compensate for the instability. These findings also help explain the empirical observations of [Smith and Topin, 2019] where they show that CLR with a large maximum learning rate can dramatically accelerate learning and lead to so-called "super-convergence". We prove that our scheme excels in the problems where Hessian exhibits a bimodal spectrum and the eigenvalues can be grouped into two clusters (small and large). The unstably large step is the key to enabling fast convergence over the small eigen-spectrum.
翻译:常规智慧要求学习率应该在稳定的制度下, 以便梯度算法不会爆炸。 这封信引入了一个简单的假设: 一个无法想象的大型学习率计划会导致超快趋同, 融合率只取决于问题的条件数量。 我们的计划使用一个环球学习率(CLR ), 我们定期采取一个大的不稳定步骤和几个小的稳定步骤来补偿不稳定。 这些结果也有助于解释[ Smith and Topin, 2019] 的经验性观察, 其中显示, 高学习率的CLR能够大大加速学习, 并导致所谓的“超级融合 ” 。 我们证明, 我们的计划在赫森展示双模式频谱和亚值的问题中非常出色, 可以分为两大类( 大小 ) 。 不可想象的大步是让小类类生物快速趋同的关键。