Sharpness of minima is a promising quantity that can correlate with generalization in deep networks and, when optimized during training, can improve generalization. However, standard sharpness is not invariant under reparametrizations of neural networks, and, to fix this, reparametrization-invariant sharpness definitions have been proposed, most prominently adaptive sharpness (Kwon et al., 2021). But does it really capture generalization in modern practical settings? We comprehensively explore this question in a detailed study of various definitions of adaptive sharpness in settings ranging from training from scratch on ImageNet and CIFAR-10 to fine-tuning CLIP on ImageNet and BERT on MNLI. We focus mostly on transformers for which little is known in terms of sharpness despite their widespread usage. Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup. Interestingly, in multiple cases, we observe a consistent negative correlation of sharpness with out-of-distribution error implying that sharper minima can generalize better. Finally, we illustrate on a simple model that the right sharpness measure is highly data-dependent, and that we do not understand well this aspect for realistic data distributions. The code of our experiments is available at https://github.com/tml-epfl/sharpness-vs-generalization.
翻译:微型微粒的锐化是一个很有希望的数量,可以与深层网络的概括化相联系,在培训期间优化,可以改进一般化。然而,标准锐化在神经网络的重新对称中并不是易变的,为了解决这个问题,提出了重新对称-异变的锐化定义,最明显的是适应性锐化(Kwon等人,2021年)。但是,它是否真正反映了现代实际环境中的概括化?我们在详细研究从图像网和CIFAR-10的零点培训到图像网CLIP和MNLI的微调CLIP等各种环境的适应性锐化定义时,全面探讨这一问题。我们主要关注的是变异器,尽管其广泛使用,但以敏锐性为人们所知甚少。总体而言,我们发现,敏化与一般化(Kwonet等人等人,2021年)的急剧化(Kwonwond) 以及一些培训参数,例如学习率可能积极或消极地与设置上的概括化相联系。有趣的是,在许多情况下,我们发现,我们发现与分配模式外推差差的错误有一致的对比,意味着,在精确的微微微微的微缩/精确化数据是用来说明。