For training neural networks, flat-minima optimizers that seek to find parameters in neighborhoods having uniformly low loss (flat minima) have been shown to improve upon stochastic and adaptive gradient-based methods. Two methods for finding flat minima stand out: 1. Averaging methods (i.e., Stochastic Weight Averaging, SWA), and 2. Minimax methods (i.e., Sharpness Aware Minimization, SAM). However, despite similar motivations, there has been limited investigation into their properties and no comprehensive comparison between them. In this work, we investigate the loss surfaces from a systematic benchmarking of these approaches across computer vision, natural language processing, and graph learning tasks. The results lead to a simple hypothesis: since both approaches find different flat solutions, combining them should improve generalization even further. We verify this improves over either flat-minima approach in 39 out of 42 cases. When it does not, we investigate potential reasons. We hope our results across image, graph, and text data will help researchers to improve deep learning optimizers, and practitioners to pinpoint the optimizer for the problem at hand.
翻译:对于培训神经网络而言,试图在平均低损失的街区找到参数的平板微粒优化器(微缩微粒)已经显示在随机性和适应性梯度方法上有所改进。发现平板微粒的两种方法突出:1. 垂直方法(即Stochastic Weight Averaging,SWA)和2. 小型方法(即敏锐认识最小化,SAM),然而,尽管动机相似,但对其特性的调查有限,彼此之间没有全面比较。在这项工作中,我们调查这些方法在计算机视觉、自然语言处理和图表学习任务方面的系统基准中的损失表面。结果引出一个简单的假设:由于两种方法都找到不同的平板解决办法,将它们结合起来,甚至可以进一步改进一般化。我们在42个案例中的39个案例中,对平板微粒方法(即敏化意识最小化,SAM)取得了改进。当它没有做到时,我们调查潜在的原因。我们希望我们的图像、图表和文本数据结果将有助于研究人员改进深层学习的优化,以及从业人员对手问题进行精确化。