Learning rate decay (lrDecay) is a \emph{de facto} technique for training modern neural networks. It starts with a large learning rate and then decays it multiple times. It is empirically observed to help both optimization and generalization. Common beliefs in how lrDecay works come from the optimization analysis of (Stochastic) Gradient Descent: 1) an initially large learning rate accelerates training or helps the network escape spurious local minima; 2) decaying the learning rate helps the network converge to a local minimum and avoid oscillation. Despite the popularity of these common beliefs, experiments suggest that they are insufficient in explaining the general effectiveness of lrDecay in training modern neural networks that are deep, wide, and nonconvex. We provide another novel explanation: an initially large learning rate suppresses the network from memorizing noisy data while decaying the learning rate improves the learning of complex patterns. The proposed explanation is validated on a carefully-constructed dataset with tractable pattern complexity. And its implication, that additional patterns learned in later stages of lrDecay are more complex and thus less transferable, is justified in real-world datasets. We believe that this alternative explanation will shed light into the design of better training strategies for modern neural networks.
翻译:学习速率衰减( lrDecay) 是用于培训现代神经网络的一种技术。 它从高学习速率开始, 并多次衰减。 实验表明, 在培训深度、 广度和不精密的现代神经网络时, 无法充分解释 lrDecay 的总体效果。 我们提供了另一个新解释: 最初的大学习速率抑制了网络, 使网络从( Stochistic) 的杂乱数据中解脱出来, 同时使学习率衰减, 提高了对复杂模式的学习。 所提议的解释在精心构建的数据集中得到验证, 并具有可感知的模式复杂性。 尽管这些共同信念受到欢迎, 实验表明, 在培训深度、 广度和不精密的现代神经网络中, 无法充分解释 lrDecay 的总体效果。 我们提出的另一个新解释是: 最初的大学习速率抑制了网络, 无法在学习速率降低的同时, 有助于学习模式的学习。 拟议的解释在精心构建的数据集中, 和可感动性模式的复杂性。 它意味着, 在后期所学的其他模式中学到的模型将更复杂, 我们相信这个更难于现代设计的网络。