By driving models to converge to flat minima, sharpness-aware learning algorithms (such as SAM) have shown the power to achieve state-of-the-art performances. However, these algorithms will generally incur one extra forward-backward propagation at each training iteration, which largely burdens the computation especially for scalable models. To this end, we propose a simple yet efficient training scheme, called Randomized Sharpness-Aware Training (RST). Optimizers in RST would perform a Bernoulli trial at each iteration to choose randomly from base algorithms (SGD) and sharpness-aware algorithms (SAM) with a probability arranged by a predefined scheduling function. Due to the mixture of base algorithms, the overall count of propagation pairs could be largely reduced. Also, we give theoretical analysis on the convergence of RST. Then, we empirically study the computation cost and effect of various types of scheduling functions, and give directions on setting appropriate scheduling functions. Further, we extend the RST to a general framework (G-RST), where we can adjust regularization degree on sharpness freely for any scheduling function. We show that G-RST can outperform SAM in most cases while saving 50\% extra computation cost.
翻译:通过驱使模型收敛至平缓极值点,锐度感知学习算法(例如 SAM 算法)已经表现出在深度学习领域中获得最先进性能的实用性。然而,这些算法通常会在每次训练迭代时额外产生一次正反馈传播,这对于可伸缩模型的计算负担较大。为此,我们提出了一种简单而高效的训练方案——随机锐度感知训练(RST)。RST 中的优化器会在每次迭代时执行一次伯努利试验,以预定的调度函数为基础,随机选择从基础算法(如 SGD)和锐度感知算法(SAM)中进行选择。由于基础算法的混合,前向和后向传递的总数就可以大大降低。此外,我们对 RST 的收敛性进行了理论分析。然后,我们在实证研究中研究了各种类型调度函数的计算成本和效果,并给出了设置适当调度函数的方向。此外,我们将 RST 扩展到通用框架(G-RST),可以自由调整任何调度函数的锐度正则化程度。我们发现,G-RST 可以在大多数情况下优于 SAM,同时节省 50% 的额外计算成本。