Sharpness-Aware Minimization (SAM) is a recent training method that relies on worst-case weight perturbations which significantly improves generalization in various settings. We argue that the existing justifications for the success of SAM which are based on a PAC-Bayes generalization bound and the idea of convergence to flat minima are incomplete. Moreover, there are no explanations for the success of using $m$-sharpness in SAM which has been shown as essential for generalization. To better understand this aspect of SAM, we theoretically analyze its implicit bias for diagonal linear networks. We prove that SAM always chooses a solution that enjoys better generalization properties than standard gradient descent for a certain class of problems, and this effect is amplified by using $m$-sharpness. We further study the properties of the implicit bias on non-linear networks empirically, where we show that fine-tuning a standard model with SAM can lead to significant generalization improvements. Finally, we provide convergence results of SAM for non-convex objectives when used with stochastic gradients. We illustrate these results empirically for deep networks and discuss their relation to the generalization behavior of SAM. The code of our experiments is available at https://github.com/tml-epfl/understanding-sam.
翻译:最近的一项培训方法,依靠最坏情况重量的干扰,大大改进了各种环境的概括性。我们争辩说,基于PAC-Bayes通用约束和统一到平面微粒的想法的SAM成功的现有理由并不完全。此外,对于在SAM中成功使用美元-sharrpness(SAM)作为普遍化的基本要素,没有任何解释。为了更好地了解SAM的这一方面,我们从理论上分析其对对角线性网络的隐含偏差。我们证明,SAM总是选择一种比标准梯度梯度下降对某类问题具有更好的普遍性能的解决办法,而这种效果则通过使用美元-sharrpness来扩大。我们进一步根据经验研究非线性网络隐含的偏差的特性,我们在SAM的微调中显示出与SAM的标准模型的微调可导致显著的概括性改进。最后,我们从理论上分析SAM的趋同结果,以非同性线性化为目的,我们证明Sammexcricticrical梯度所使用的非共性目标。我们用这些结果来说明深层网络/SAmb的实验性研究。我们现有的实验性研究。