For training neural networks, flat-minima optimizers that seek to find parameters in neighborhoods having uniformly low loss (flat minima) have been shown to improve upon stochastic and adaptive gradient-based methods. Two methods for finding flat minima stand out: 1. Averaging methods (i.e., Stochastic Weight Averaging, SWA), and 2. Minimax methods (i.e., Sharpness Aware Minimization, SAM). However, despite similar motivations, there has been limited investigation into their properties and no comprehensive comparison between them. In this work, we investigate the loss surfaces from a systematic benchmarking of these approaches across computer vision, natural language processing, and graph learning tasks. This leads us to a hypothesis: since both approaches find flat solutions in orthogonal ways, combining them should improve generalization even further. We verify this improves over either flat-minima approach in 39 out of 42 cases. When it does not, we provide potential explanations. We hope our results across image, graph, and text data will help researchers to improve deep learning optimizers, and practitioners to pinpoint the optimizer for the problem at hand.
翻译:对于培训神经网络而言,试图在平均低损失的街区找到参数的平板微粒优化器(微缩微粒)已经显示在随机和适应性梯度方法上有所改进。发现平板微粒的两种方法非常突出:1. 垂直方法(即斯托切斯微弱变异,SWA)和2. 微米方法(即敏锐认识最小化,SAM),然而,尽管动机相似,但对其特性的调查有限,彼此之间没有全面比较。在这项工作中,我们调查这些方法在计算机视觉、自然语言处理和图解学习任务方面的系统基准所造成的损失表面。这使我们得出一个假设:两种方法在或图解方法中都找到平坦的解决方案,将它们合并起来,甚至可以进一步改进一般化。我们核实在42个案例中,在平板微粒方法中,有39个案例(即敏锐意识最小化,SAM)改进了这一方法。当没有这样做时,我们提供潜在的解释。我们希望我们的图像、图表和文本数据能够帮助研究人员改进深度学习的优化,以及从业人员对问题进行精确的优化。