Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes the third notion of sharpness mentioned above, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the alignment between the gradient and the top eigenvector of Hessian when SAM is applied.
翻译:锐锐度最小化(SAM)是一种非常有效的正规化技术,用于改进各种环境的深神经网络的普及性,但是,由于理论定性中各种令人感兴趣的近似值,SAM的基本工作仍然难以实现。SAM打算惩罚模型的锐度概念,但采用一种计算效率高的变体;此外,还使用了第三个锐度概念来证明一般化保障。这些锐度概念的细微差异确实可能导致显著不同的经验性结果。本文严格地将SAM规范并澄清基本机制的精确锐度概念固定下来。我们还表明,SAM最初动机的两个近似步骤导致不准确的本地结论,但两者的结合无意地揭示了在应用全称梯度时的正确效果。此外,我们还证明SAM事实上的奇异性版本对上面提到的第三个锐度概念进行了规范,这很可能是实际表现的首选概念。这一诱因现象背后的关键机制是梯度和赫斯安亚的顶部在应用SAM时的梯度和顶端导师之间的对齐。