Deep neural networks often suffer from poor generalization caused by complex and non-convex loss landscapes. One of the popular solutions is Sharpness-Aware Minimization (SAM), which smooths the loss landscape via minimizing the maximized change of training loss when adding a perturbation to the weight. However, we find the indiscriminate perturbation of SAM on all parameters is suboptimal, which also results in excessive computation, i.e., double the overhead of common optimizers like Stochastic Gradient Descent (SGD). In this paper, we propose an efficient and effective training scheme coined as Sparse SAM (SSAM), which achieves sparse perturbation by a binary mask. To obtain the sparse mask, we provide two solutions which are based onFisher information and dynamic sparse training, respectively. In addition, we theoretically prove that SSAM can converge at the same rate as SAM, i.e., $O(\log T/\sqrt{T})$. Sparse SAM not only has the potential for training acceleration but also smooths the loss landscape effectively. Extensive experimental results on CIFAR10, CIFAR100, and ImageNet-1K confirm the superior efficiency of our method to SAM, and the performance is preserved or even better with a perturbation of merely 50% sparsity. Code is availiable at https://github.com/Mi-Peng/Sparse-Sharpness-Aware-Minimization.
翻译:深心神经网络往往因复杂和非骨质损失的景观而缺乏全面性。 流行的解决方案之一是“ 锐利―― 最小化 ” ( SAM), 通过将培训损失的最大变化最小化, 从而将培训损失最小化, 从而将培训损失最小化。 然而, 我们发现, SAM在所有参数上不分青红皂白的扰动是不完美的, 也导致过度计算, 即将像Stochatistic Gratientle( SGD) 这样的普通优化者的管理费用增加一倍。 在本文中, 我们提出一个高效和有效的培训计划, 以 Ssparse SAM (SAM) (SAM), 通过一个二元化的面具, 使损失最小化。 为了获得稀薄的面具, 我们提供两种解决方案, 分别以Fisher信息和动态稀薄的培训为基础。 此外, 我们理论上证明, SAM 可以与SAM, e. e, $O ( ) (log/\\\ sqrent) abiltial) privation (S- revality) (S- revalation) comlistration) 不仅具有培训加速加速加速加速加速,, 并且 和 快速的图像效率 10 (SIRAHR- sal- sal- sal) 也有效确认我们图像- sal- sal develvial) 10 (S- sal) 的图像- sal) 和S- salvalvivalvial- sal- sal) 的方法。