We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and its unnormalized variant USAM, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the step size). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones - by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that perhaps unexpectedly SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.
翻译:我们研究的是SAM(SDEs)及其非常规变式USAM的连续时间模型(以SDEs的形式)的衍生,这既适用于全包和微型批量设置。我们证明这些SAM(SAM)是实际离散时间算法的严格近似值(从一种弱的意义上说,以步数缩放为线性缩放 ) 。我们然后用这些模型解释为什么SAM偏爱平式迷你,而不是尖锐型模型 — — 表明它通过一个依赖赫西恩的噪音结构将隐含的定期损失降到最低。最后,我们证明SAM在某些现实条件下可能意外地被吸引到抛锚点。我们的理论结果得到了详细实验的支持。