This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly initialized neural networks. Leveraging an in-depth analysis of neural chains, we first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth), even when initialized with the popular Xavier and He initializations. Second, we extend the analysis to second-order derivatives and show that random i.i.d. initialization also gives rise to Hessian matrices with eigenspectra that vanish as networks grow in depth. Whenever this happens, optimizers are initialized in a very flat, saddle point-like plateau, which is particularly hard to escape with stochastic gradient descent (SGD) as its escaping time is inversely related to curvature. We believe that this observation is crucial for fully understanding (a) historical difficulties of training deep nets with vanilla SGD, (b) the success of adaptive gradient methods (which naturally adapt to curvature and thus quickly escape flat plateaus) and (c) the effectiveness of modern architectural components like residual connections and normalization layers.
翻译:本文重新审视了所谓的消失梯度现象, 这种现象通常在深层随机初始神经网络中发生。 通过对神经链进行深入分析, 我们首先显示, 当网络宽度比O( 深度)小时, 即使与流行的 Xavier 和 He 初始化程序一起初始化, 也不可能绕过消失梯度。 其次, 我们将分析扩展至第二阶衍生物, 并显示随机i. d. 初始化还产生随着网络的深度增长而消失的海珊基质。 每当发生这种情况时, 优化器在非常平坦的、 上铺垫点相似的高原上初始化, 这特别难以避免, 因为缓冲的梯度下降时间与曲线反常相关。 我们认为, 这一观察对于充分理解 (a) 与Vanilla SGD培训深网的历史困难至关重要, (b) 适应性梯度方法的成功( 自然适应曲线, 从而迅速摆脱平原高原) 以及 (c) 现代建筑构件( 如残余连接和平整层 ) 的有效性。