Weight decay is a popular regularization technique for training of deep neural networks. Modern deep learning libraries mainly use $L_{2}$ regularization as the default implementation of weight decay. \citet{loshchilov2018decoupled} demonstrated that $L_{2}$ regularization is not identical to weight decay for adaptive gradient methods, such as Adaptive Momentum Estimation (Adam), and proposed Adam with Decoupled Weight Decay (AdamW). However, we found that the popular implementations of weight decay, including $L_{2}$ regularization and decoupled weight decay, in modern deep learning libraries usually damage performance. First, the $L_{2}$ regularization is unstable weight decay for all optimizers that use Momentum, such as stochastic gradient descent (SGD). Second, decoupled weight decay is highly unstable for all adaptive gradient methods. We further propose the Stable Weight Decay (SWD) method to fix the unstable weight decay problem from a dynamical perspective. The proposed SWD method makes significant improvements over $L_{2}$ regularization and decoupled weight decay in our experiments. Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can usually outperform complex Adam variants, which have more hyperparameters.
翻译:重力衰减是培养深神经网络的一种流行的正规化技术。现代深层学习图书馆主要使用以美元为单位的正规化,以默认实施重量衰减。\ citet{loshchilov2018decouple} 显示,美元正规化与适应性梯度方法(如适应性动动动动动动动动动动动动动动动动动动动动动动动动)和拟议采用分解式重力衰减(AdamW)的亚当(Adam)相比,重力衰减(包括美元)普遍实施重衰减(包括美元正规化和分解重量衰减)通常会损害性能。首先,美元正规化是所有使用运动性梯度梯度梯度梯度的优化剂(SGD),重量衰减(Adamdam)并不完全相同。我们进一步从动态角度提出了稳定不稳定性重衰减(SWD)问题的方法。拟议的SWD方法使美元(SWD)的重量衰减幅度大幅提高,通常由S&Q(Squm)的变压(Squm)的变压(Squm)变压(Sy)和变压(Saddock)的变压(Shard)的变压(Sadd),这些变压(S&d)的变压(Squmd)的变压(S&)法。