Weight decay is a popular and even necessary regularization technique for training deep neural networks that generalize well. Previous work usually interpreted weight decay as a Gaussian prior from the Bayesian perspective. However, weight decay sometimes shows mysterious behaviors beyond the conventional understanding. For example, the optimal weight decay value tends to be zero given long enough training time. Moreover, existing work typically failed to recognize the importance of scheduling weight decay during training. Our work aims at theoretically understanding novel behaviors of weight decay and designing schedulers for weight decay in deep learning. This paper mainly has three contributions. First, we propose a novel theoretical interpretation of weight decay from the perspective of learning dynamics. Second, we propose a novel weight-decay linear scaling rule for large-batch training that proportionally increases weight decay rather than the learning rate as the batch size increases. Third, we provide an effective learning-rate-aware scheduler for weight decay, called the Stable Weight Decay (SWD) method, which, to the best of our knowledge, is the first practical design for weight decay scheduling. In our various experiments, the SWD method often makes improvements over $L_{2}$ Regularization and Decoupled Weight Decay.
翻译:重力衰减是培训深神经网络的一种普遍、甚至必要的正规化技术,可以广泛推广。 以往的工作通常将重力衰减解释为拜伊西亚人之前的高斯人。 但是,重量衰减有时会显示超出常规理解的神秘行为。 例如, 最佳重衰变值往往为零, 培训时间过长。 此外, 现有工作通常没有认识到在培训期间将重衰减列入计划的重要性。 我们的工作目的是从理论上理解体重衰减的新行为和设计深层学习中体重衰减的时间表。 本文主要有三种贡献。 首先, 我们从学习动态的角度提出对重量衰减的新理论解释。 第二, 我们提出对大批培训提出一个新的重量减速线性线性调整规则, 随着批量规模的增加, 按比例增加重量衰减率, 而不是学习率。 第三, 我们为体重衰减提供有效的学习节率计时, 称为Stagretweight Decay (SWD) 方法, 据我们所知, 这是体重衰减的第一个实用设计。 在各种实验中, SWDD方法经常改进 Decight $WQ}