In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics such as gradient clipping and learning rate warmup. Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoid -- or navigate out of -- regions of high curvature and into flatter regions that tolerate a higher learning rate. Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor conditioning. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization.
翻译:在这项工作中,我们研究了赫森在许多分类任务中损失的演变情况,以便了解损失的曲线对培训动态的影响。虽然以前的工作侧重于不同的学习率如何影响培训期间观察到的损失,但我们也分析了模型初始化、建筑选择以及诸如梯度剪裁和学习率暖化等共同的培训累赘学力学的影响。我们的结果表明,成功的模型和超参数选择使得早期优化轨道能够避免或摆脱高曲线地区和容忍更高学习率的受宠区域。我们的结果表明,对培训不稳定性的不同减缓战略如何统一的观点,最终解决神经网络优化的同样基本失败模式,即条件差。我们从调节角度的激励下,我们表明学习速度的暖化可以提高培训稳定性,就像分批正常化、层正常化、MetaInit、GradInit和修整初始化一样。