To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics that the nature of the trained model radically changes depending on the scale of the initial random weights, ranging from a kernel regime (for large initial variance) to a feature learning regime (for small initial variance). For deeper networks more regimes are possible, and in this paper we study in detail a specific choice of "small" initialization corresponding to ''mean-field'' limits of neural networks, which we call integrable parameterizations (IPs). First, we show that under standard i.i.d. zero-mean initialization, integrable parameterizations of neural networks with more than four layers start at a stationary point in the infinite-width limit and no learning occurs. We then propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics. In particular, one of these methods consists in using large initial learning rates, and we show that it is equivalent to a modification of the recently proposed maximal update parameterization $\mu$P. We confirm our results with numerical experiments on image classification tasks, which additionally show a strong difference in behavior between various choices of activation functions that is not yet captured by theory.
翻译:为了从理论上理解受过训练的深神经网络的行为,有必要从随机初始化中研究梯度方法引起的动态。 但是,这些模型的非线性和构成结构使得这些动态难以分析。 要克服这些挑战,最近出现了大宽线性无序性的观点,并导致对现实世界深度网络的实际洞察力。对于两层神经网络来说,我们通过这些抽象概念理解到,经过训练的模型的性质发生了根本性的变化,这取决于初始随机权重的大小,从一个随机权重(大初始差异)到一个特征学习机制(小初始差异),这些模型的非线性结构使得这些动态难以分析。对于更深层次的网络来说,我们可以详细研究一个具体选择的“小”初始化,与“初级”网络的极限相对应,我们称之为不可忽视的参数化。对于两层神经网络来说,我们通过这些简单的初始权重(i.d. 零度初始化), 坚固性网络的参数化程度比四层强(初始差异)到一个特征(初始差异)系统。 对于更深的网络来说,我们无法在最初的层次上更新,我们用无限的精确的轨法方法 来显示这些深度的精确的精确的实验, 的精确的精确的精确度, 的精确度是用来显示这些方法。