Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. Although some studies have reported that GR improves generalization performance in deep learning, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost for GR. In addition, this computation empirically achieves better generalization performance. Next, we theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias in a certain problem. In particular, learning with the finite-difference GR chooses better minima as the ascent step size becomes larger. Finally, we demonstrate that finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima: sharpness-aware minimization and the flooding method. We reveal that flooding performs finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR in both practice and theory.
翻译:虽然一些研究报告说,GR改进了深层学习的概括性表现,但从算法角度,即有效提高业绩的GR算法来看,它很少受到重视。在本研究中,我们首先发现,由梯度升降和降级步骤组成的特定定点差异计算降低了GR的计算成本。此外,这种计算还取得了更好的概括性表现。接下来,我们从理论上分析一种可溶模型,一种对角线性网络,并澄清GR在某些问题上有可取的隐含偏差。特别是,与有限偏差GR的学习选择了更好的微米,因为最近步骤的大小越来越大。最后,我们证明,定点差异GR与一些基于迭代性升和降级步骤探索平坦微型:敏锐度最小化和洪涝方法的其他算法密切相关。我们揭示,洪水以隐含的方式使GR具有一定的距离。因此,这项工作扩大了我们对GR的理解。