In recent years, a critical initialization scheme with orthogonal initialization deep nonlinear networks has been proposed. The orthogonal weights are crucial to achieve dynamical isometry for random networks, where entire spectrum of singular values of an input-output Jacobian are around one. The strong empirical evidence that orthogonal initialization in linear networks and the linear regime of non-linear networks can speed up training than Gaussian networks raise great interests. One recent work has proven the benefit of orthogonal initialization in linear networks. However, the dynamics behind it have not been revealed on non-linear networks. In this work, we study the Neural Tangent Kernel (NTK), which describes the gradient descent training of wide networks, on orthogonal, wide, fully-connect, and nonlinear networks. We prove that NTK of Gaussian and Orthogonal weights are equal when the network width is infinite, resulting in a conclusion that orthogonal initialization can speed up training is a finite-width effect in the small learning rate region. Then we find that during training, the NTK of infinite-width networks with orthogonal initialization stay constant theoretically and vary at a rate of the same order empirically as Gaussian ones, as the width tends to infinity. Finally, we conduct a thorough empirical investigation of training speed on CIFAR10 datasets and show the benefit of orthogonal initialization lies in the large learning rate and depth region in a linear regime of nonlinear networks.
翻译:近些年来, 提出了一个具有正向初始化深非线性网络的临界初始化计划。 正在提出一个具有正向初始化作用的精密初始化计划。 但是, 在非线性网络中, 矩形加权对于实现随机网络动态等量测量至关重要, 随机网络中, 输入输出 Jacobian 的整个单值范围环绕一个网络。 有力的实证证据表明, 线性网络和非线性网络的直线性初始化和线性网络线性体系可以比高斯网络更快地提高培训速度。 最近的一项工作证明, 在线性网络中, 垂直初始化初始化初始化的好处是非线性网络的精度效应。 在此工作中, 我们研究Neural Tangent Kernel (NTK), 描述宽度网络的梯度下降性下降率培训, 宽度、 宽度、 完全连接和非线性网络。 我们证明, 当网络的宽度和直径直线性网络在初始化程度区域, 和直径直线性网络在初始化过程中, 显示直线性水平级学习速度性网络的稳定性, 直线性水平, 和直线性水平在不断的不断演进进度上, 。