Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here "sharpness" is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.
翻译:普通化层(例如批量正常化、层级正常化)的引入有助于在非常深的网中解决优化困难,但显然也有助于普及化,即使在不那么深的网中也是如此。 长期的观念认为,平面小型网能带来更好的概括化。 本文提供了数学分析和支持实验,表明正常化(连同随附的重量下降)能鼓励GD降低损失表面的锐度。 这里“精度”是谨慎定义的,因为损失是规模变化性的,是正常化的一个已知后果。 具体地说,对于相当广泛的神经网类别,我们理论解释了有一定学习率的GD如何进入所谓的“稳定边缘”制度,并且通过持续的锐度降低流动来描述GD在这个制度中的轨迹。