The success of deep learning has revealed the application potential of neural networks across the sciences and opened up fundamental theoretical problems. In particular, the fact that learning algorithms based on simple variants of gradient methods are able to find near-optimal minima of highly nonconvex loss functions is an unexpected feature of neural networks. Moreover, such algorithms are able to fit the data even in the presence of noise, and yet they have excellent predictive capabilities. Several empirical results have shown a reproducible correlation between the so-called flatness of the minima achieved by the algorithms and the generalization performance. At the same time, statistical physics results have shown that in nonconvex networks a multitude of narrow minima may coexist with a much smaller number of wide flat minima, which generalize well. Here we show that wide flat minima arise as complex extensive structures, from the coalescence of minima around "high-margin" (i.e., locally robust) configurations. Despite being exponentially rare compared to zero-margin ones, high-margin minima tend to concentrate in particular regions. These minima are in turn surrounded by other solutions of smaller and smaller margin, leading to dense regions of solutions over long distances. Our analysis also provides an alternative analytical method for estimating when flat minima appear and when algorithms begin to find solutions, as the number of model parameters varies.
翻译:深层学习的成功揭示了神经网络在整个科学中的应用潜力,并开启了根本性的理论问题;特别是,基于简单的梯度变方程式的学习算法能够找到高度非康维克斯损失功能的近最佳微量算法,这是神经网络一个出乎意料的特征;此外,这种算法能够适应数据,即使有噪音,但具有极好的预测能力;一些经验结果显示,在所谓的微量计算法和一般化性能所实现的微量模型平坦度之间,存在着可复制的相互关系;与此同时,统计物理结果显示,在非康维克斯网络中,许多狭窄的微量算法可能与数量少得多的宽度迷你迷你模型共存,这些微小的微量算法可能与数量少得多的宽广度迷你模型共存在一起;在这里,我们这些微小的迷你算法从“高间”(即,当地稳健的模型)组合群聚在一起,尽管与零差参数相比是指数性极罕见的,但微型算法往往会集中在特定区域;这些微小微微微微微微的微型算法开始,当我们最接近的模型的分析,然后开始,以较小的距离的模型开始,以其他的方法来分析。