通过超参数化神经网络中的非典型的“阶段过渡”学习 (Learning through atypical "phase transitions" in overparameterized neural networks)

Current deep neural networks are highly overparameterized (up to billions of connection weights) and nonlinear. Yet they can fit data almost perfectly through variants of gradient descent algorithms and achieve unexpected levels of prediction accuracy without overfitting. These are formidable results that defy predictions of statistical learning and pose conceptual challenges for non-convex optimization. In this paper, we use methods from statistical physics of disordered systems to analytically study the computational fallout of overparameterization in non-convex binary neural network models, trained on data generated from a structurally simpler but "hidden" network. As the number of connection weights increases, we follow the changes of the geometrical structure of different minima of the error loss function and relate them to learning and generalization performance. A first transition happens at the so-called interpolation point, when solutions begin to exist (perfect fitting becomes possible). This transition reflects the properties of typical solutions, which however are in sharp minima and hard to sample. After a gap, a second transition occurs, with the discontinuous appearance of a different kind of "atypical" structures: wide regions of the weight space that are particularly solution-dense and have good generalization properties. The two kinds of solutions coexist, with the typical ones being exponentially more numerous, but empirically we find that efficient algorithms sample the atypical, rare ones. This suggests that the atypical phase transition is the relevant one for learning. The results of numerical tests with realistic networks on observables suggested by the theory are consistent with this scenario.

翻译：目前深层神经网络高度超度(高达数十亿个连接权重)和非线性。然而,它们可以通过梯度下移算法变量将数据几乎完全匹配,并实现出乎意料的准确性水平,而不会过于完善。这些都是难以预测的惊人结果,无法预测统计学习,给非康韦克斯优化带来了概念挑战。在本文中,我们使用从统计物理学学学学学学学学上混乱的系统来分析非康维克斯二进制神经网络模型中超度的计算后果,经过关于结构简单但“ 隐藏” 网络生成的数据的培训。随着连接权重数量的增加,我们跟踪错误损失函数中不同缩微量结构的几何结构的变化,并将这些变化与学习和概括性表现相挂钩。当解决方案开始出现时,我们第一次的交替发生在所谓的内插点( 最有可能实现完善 ) 。这种转变反映了典型解决方案的特性, 但是在精细微的微和难测的神经网络中, 第二次转型发生, 不同“ 典型的“ 典型” 模型” 结构的不固定的外观的外观性结构, 表明, 高度空间的高度空间的深度的模型是我们所呈现的典型的典型的典型的典型的模型。