Modern deep learning models are over-parameterized, where different optima can result in widely varying generalization performance. To account for this, Sharpness-Aware Minimization (SAM) modifies the underlying loss function to guide descent methods towards flatter minima, which arguably have better generalization abilities. In this paper, we focus on a variant of SAM known as micro-batch SAM (mSAM), which, during training, averages the updates generated by adversarial perturbations across several disjoint shards (micro batches) of a mini-batch. We extend a recently developed and well-studied general framework for flatness analysis to show that distributed gradient computation for sharpness-aware minimization theoretically achieves even flatter minima. In order to support this theoretical superiority, we provide a thorough empirical evaluation on a variety of image classification and natural language processing tasks. We also show that contrary to previous work, mSAM can be implemented in a flexible and parallelizable manner without significantly increasing computational costs. Our practical implementation of mSAM yields superior generalization performance across a wide range of tasks compared to SAM, further supporting our theoretical framework.
翻译:现代深层次学习模式被过分地分辨,不同的选择方法可能导致差异很大的概括性表现。为此,我们扩展了最近开发的和研究良好的平板分析总体框架,以显示在理论上为敏锐度最小化而分布梯度的计算方法甚至能够取得优美的微米。为了支持这种理论优势,我们对各种图像分类和自然语言处理任务进行了彻底的经验评估。我们还表明,与以前的工作相反,在不大幅提高计算成本的情况下,可以灵活和平行地执行微小杯的对称扰动。我们实际执行微小杯的MSAM使得与SAM相比,在广泛的任务中实现更优的概括性表现,进一步支持我们的理论框架。