We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence. In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian. In such cases, SAM's update may be regarded as a third derivative -- the derivative of the Hessian in the leading eigenvector direction -- that encourages drift toward wider minima.
翻译:我们考虑一种称为SAM的梯度优化方法,它对于深度网络在 图像和语言预测问题上表现出了性能改进。我们表明,当SAM应用于凸二次目标时,对于大部分随机初始化,它会收敛到一个循环,在曲率最大的方向上震荡,并来回在最小值的两侧徘徊。我们还提供了收敛速度的界限。在非二次情况下,我们表明,这种振荡实际上是在Hessian的谱范数上执行了一个步长更小的梯度下降,并且SAM的更新可以被视为第三导数 - 在主要特征向量方向上的Hessian的一阶导数-鼓励向更宽的极小值漂移。