We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and two of its variants, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the step size). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones~--~by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that perhaps unexpectedly SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.
翻译:我们研究的是SAM(SDEs)的连续时间模型(以SDEs的形式)及其两个模型(即全包和微型批量设置)。我们证明,这些SAM(SAM)是实际离散时间算法的严格近似值(从弱的意义上说,用步数缩放线性缩放),我们随后解释为什么SAM更喜欢平式微型模型而不是尖度梯度梯度梯度梯度梯度梯度梯度模型。我们的主要贡献是利用这些模型,表明它以赫森独立噪音结构来尽量减少隐含的正常损失。最后,我们证明,SAM在某些现实条件下可能意外地被吸引到顶点。我们理论结果得到详细实验的支持。