How to train deep neural networks (DNNs) to generalize well is a central concern in deep learning, especially for severely overparameterized networks nowadays. In this paper, we propose an effective method to improve the model generalization by additionally penalizing the gradient norm of loss function during optimization. We demonstrate that confining the gradient norm of loss function could help lead the optimizers towards finding flat minima. We leverage the first-order approximation to efficiently implement the corresponding gradient to fit well in the gradient descent framework. In our experiments, we confirm that when using our methods, generalization performance of various models could be improved on different datasets. Also, we show that the recent sharpness-aware minimization method \cite{DBLP:conf/iclr/ForetKMN21} is a special, but not the best, case of our method, where the best case of our method could give new state-of-art performance on these tasks.
翻译:如何训练深心神经网络(DNN)来普及信息是深层学习的中心问题,特别是目前严重超分的网络。在本文中,我们提出一个有效方法来改进模型的概括化,在优化过程中进一步惩罚损失功能的梯度规范。我们证明,限制损失功能的梯度规范可以帮助优化者找到平坦的迷你。我们利用第一级近似来有效地实施相应的梯度,以适应梯度框架。在我们的实验中,我们确认,在使用我们的方法时,不同模型的概括化性表现可以在不同的数据集上得到改进。此外,我们表明,最近的敏锐度最小化方法 \ cite{DBLP:conf/iclr/ForetKMN21} 是我们方法的一个特别但并非最佳的例子,我们的方法的最好实例是,我们的方法可以给这些任务带来新的最新表现。