Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.
翻译:尽管非碳化优化地貌,过度平衡的浅层网络能够在梯度下降的情况下实现全球趋同。对于狭窄的网络来说,情况可能完全不同,这些网络往往被困在非常普遍的当地小型网络中。在这里,我们调查了这两个制度在高维环境中的交叉情况,特别是调查所谓的中位场/水力动力系统与萨阿德和索拉的开创性方法之间的联系。我们以高山数据的情况为重点,研究学习率、时间尺度和高位梯度梯度下降高度动态中隐藏单元的数量之间的相互作用。我们的工作建立在从统计物理学中高位的SGD的确定性描述之上,我们扩展了这一描述,并为之提供了严格的趋同率。