Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional to the sum of squared weights. This paper argues that stochastic gradient descent (SGD) may be an inefficient algorithm for this objective. For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective in which the regularization term is instead a sum of products of $\ell_2$ (not squared) norms of the input and output weights associated each ReLU. This alternative (and effectively equivalent) regularization suggests a novel proximal gradient algorithm for network training. Theory and experiments support the new training approach, showing that it can converge much faster to the sparse solutions it shares with standard weight decay training.
翻译:重力衰减是深层学习中最广泛使用的正规化形式之一,已证明可以提高一般化和稳健性。优化目标驱动力衰减是损失和一个与平方重量之和成比例的术语的总和。本文认为,随机梯度梯度下降(SGD)可能是实现这一目标的一种效率低下的算法。对于使用RELU激活功能的神经网络来说,重力衰减目标的解决方案相当于一个不同目标的解决方案,即正规化术语是每个RELU相关的投入和产出重量总和(不是平方)2美元(不是平方)的产品。这一替代(实际上相当的)正规化方法表明网络培训采用一种新的原始梯度梯度算法。理论和实验支持新的培训方法,表明它可以更快地与标准重量衰减培训共享的稀少解决方案。