To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics that the nature of the trained model radically changes depending on the scale of the initial random weights, ranging from a kernel regime (for large initial variance) to a feature learning regime (for small initial variance). For deeper networks more regimes are possible, and in this paper we study in detail a specific choice of ''small'' initialization corresponding to "mean-field" limits of neural networks, which we call integrable parameterizations (IPs). First, we show that under standard i.i.d. zero-mean initialization, integrable parameterizations of neural networks with more than four layers start at a stationary point in the infinite-width limit and no learning occurs. We then propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics. In particular, one of these methods consists in using large initial learning rates, and we show that it is equivalent to a modification of the recently proposed maximal update parameterization $\mu$P. We confirm our results with numerical experiments on image classification tasks, which additionally show a strong difference in behavior between various choices of activation functions that is not yet captured by theory.
翻译:为了从理论上理解受过训练的深神经网络的行为,有必要从随机初始化中研究梯度方法引起的动态。 但是,这些模型的非线性和构成结构使得这些动态难以分析。 为了克服这些挑战,最近出现了大宽线性无序性的观点,并导致对现实世界深度网络的实际洞察力。 对于两层神经网络来说,通过这些随机学理解,经过训练的模型的性质发生了根本性的变化,这取决于初始随机权重的大小,从一个(对于初始差异较大的)内核制度到一个特征学习制度(对于初始差异小),这些模型的不线性结构使得这些动态难以分析。对于更深层次的网络来说,我们详细研究一个具体选择“小型”初始化的方法,与神经网络的“平均场”界限相对应,我们称之为不可忽视的参数化。首先,根据标准一. d. 零度初始化, 坚固的神经网络的参数化,从一个以上层次开始,到一个特殊的初始差异(对于初始差异的初始性差异)系统,我们仔细地研究了“小型”的模型,然后用这些细度方法来展示。