Sharpness aware minimization (SAM) optimizer has been extensively explored as it can generalize better for training deep neural networks via introducing extra perturbation steps to flatten the landscape of deep learning models. Integrating SAM with adaptive learning rate and momentum acceleration, dubbed AdaSAM, has already been explored empirically to train large-scale deep neural networks without theoretical guarantee due to the triple difficulties in analyzing the coupled perturbation step, adaptive learning rate and momentum step. In this paper, we try to analyze the convergence rate of AdaSAM in the stochastic non-convex setting. We theoretically show that AdaSAM admits a $\mathcal{O}(1/\sqrt{bT})$ convergence rate, which achieves linear speedup property with respect to mini-batch size $b$. Specifically, to decouple the stochastic gradient steps with the adaptive learning rate and perturbed gradient, we introduce the delayed second-order momentum term to decompose them to make them independent while taking an expectation during the analysis. Then we bound them by showing the adaptive learning rate has a limited range, which makes our analysis feasible. To the best of our knowledge, we are the first to provide the non-trivial convergence rate of SAM with an adaptive learning rate and momentum acceleration. At last, we conduct several experiments on several NLP tasks, which show that AdaSAM could achieve superior performance compared with SGD, AMSGrad, and SAM optimizers.
翻译:深入了解最小化( SAM) 优化 已被广泛探讨, 因为它可以通过引入额外的扰动步骤来将深神经网络的趋同率纳入深学习模式的景观。 将SAM与适应性学习率和加速势头相结合, AdaSAM 被称为AdaSAM, 已经在经验上探索,以培训大型深神经网络,而没有理论保证,因为分析同时的扰动步骤、适应性学习率和动力步骤有三重困难。 在本文中,我们试图分析AdaSAM在随机的非骨架设置中的趋同率。 我们理论上表明,AdaSAM 接受一个$\ mathcal{O}(1/\ sqrt{bT}) 和动力加速度相结合。 将SAM 结合到一个直线性加速度, 具体地说, 要分解调和适应性学习率, 我们引入了第二阶梯期动力术语, 让他们在分析期间独立。 然后将SAM 与一些适应性动作率相比较, 我们通过显示一个不精确的适应性学习速度, 我们的SAM 学习速度, 展示了我们的一些适应性适应性学习速度, 学习速度, 展示了我们的一些适应性加速率 。</s>