To fully uncover the great potential of deep neural networks (DNNs), various learning algorithms have been developed to improve the model's generalization ability. Recently, sharpness-aware minimization (SAM) establishes a generic scheme for generalization improvements by minimizing the sharpness measure within a small neighborhood and achieves state-of-the-art performance. However, SAM requires two consecutive gradient evaluations for solving the min-max problem and inevitably doubles the training time. In this paper, we resort to filter-wise random weight perturbations (RWP) to decouple the nested gradients in SAM. Different from the small adversarial perturbations in SAM, RWP is softer and allows a much larger magnitude of perturbations. Specifically, we jointly optimize the loss function with random perturbations and the original loss function: the former guides the network towards a wider flat region while the latter helps recover the necessary local information. These two loss terms are complementary to each other and mutually independent. Hence, the corresponding gradients can be efficiently computed in parallel, enabling nearly the same training speed as regular training. As a result, we achieve very competitive performance on CIFAR and remarkably better performance on ImageNet (e.g. $\mathbf{ +1.1\%}$) compared with SAM, but always require half of the training time. The code is released at https://github.com/nblt/RWP.
翻译:为了充分发掘深层神经网络的巨大潜力,已经开发了各种学习算法,以提高模型的概括能力。最近,敏锐觉悟最小化(SAM)通过在小社区内尽量减少敏锐度测量,并实现最先进的性能,建立了一个通用改进方案。然而,SAM要求连续进行两次梯度评估,以解决微积分问题,不可避免地使培训时间翻倍一倍。在本文中,我们采用过滤式随机权重(RWP)来分解SAM中嵌入的梯度。与SAM中小型对立性扰动不同,RWP较软,允许更大程度的扰动。具体地说,我们共同以随机扰动和原始损失功能优化损失功能:前者指导网络走向更宽的平坦区域,而后者则帮助恢复必要的当地信息。这两个损失术语是相互补充的,相互独立。因此,相应的梯度可以同时进行高效率的计算,使培训速度与常规培训几乎相同。我们总是在SFAR/RMA上取得竞争力很强的成绩,但SRA+RA值在定期培训中也比AR。