Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.
翻译:最近,试图在低损失邻里寻找参数的平面平面平面平面平面平面平面平面优化器被证明改善了神经网络对随机和适应性梯度优化器的概括性表现。两种方法由于其可缩放性而受到极大关注:1. 蒸汽湿度和2. 锐化最小化(SAM ) 。然而,对其特性的调查有限,而且没有在不同领域对其系统基准化。 我们通过比较每种方法所培训模型的损失面,并通过在计算机视觉、自然语言处理和图表展示学习任务之间进行广泛的基准测量,填补了这一空白。我们从这些结果中发现了一些令人惊讶的结果,我们希望这些结果将有助于研究人员进一步改进深层学习优化,以及从业人员找出解决其问题的适当优化因素。