The success of deep learning has revealed the application potential of neural networks across the sciences and opened up fundamental theoretical problems. In particular, the fact that learning algorithms based on simple variants of gradient methods are able to find near-optimal minima of highly nonconvex loss functions is an unexpected feature of neural networks which needs to be understood in depth. Such algorithms are able to fit the data almost perfectly, even in the presence of noise, and yet they have excellent predictive capabilities. Several empirical results have shown a reproducible correlation between the so-called flatness of the minima achieved by the algorithms and the generalization performance. At the same time, statistical physics results have shown that in nonconvex networks a multitude of narrow minima may coexist with a much smaller number of wide flat minima, which generalize well. Here we show that wide flat minima arise from the coalescence of minima that correspond to high-margin classifications. Despite being exponentially rare compared to zero-margin solutions, high-margin minima tend to concentrate in particular regions. These minima are in turn surrounded by other solutions of smaller and smaller margin, leading to dense regions of solutions over long distances. Our analysis also provides an alternative analytical method for estimating when flat minima appear and when algorithms begin to find solutions, as the number of model parameters varies.
翻译:深层学习的成功揭示了神经网络在整个科学中的应用潜力,并开启了根本性的理论问题;特别是,基于简单的梯度变方程式的学习算法能够找到高度非康维克斯损失功能的近最佳微量算法,这是神经网络中一个出乎意料的特征,需要深入理解。这种算法几乎能够完美地适应数据,即使是在有噪音的情况下,也具有极好的预测能力。一些实证结果表明,算法和一般化性能所实现的所谓微量的平坦性之间有着可复制的相互关系。与此同时,统计物理结果显示,在非康维克斯网络中,大量狭小的小型算法可能与数量少得多的广度微小的微量网络共存,需要深入理解。我们在这里显示,大面积的小型算法与高海拔的分类相对,尽管模型与零升的解决方案相比非常罕见,但高海拔的微型算法往往集中在特定区域。这些微型算法被其他的替代方法所包围,在更小的距离和更小的距离上,我们的分析方法开始成为了最密集的区域。