The prevailing thinking is that orthogonal weights are crucial to enforcing dynamical isometry and speeding up training. The increase in learning speed that results from orthogonal initialization in linear networks has been well-proven. However, while the same is believed to also hold for nonlinear networks when the dynamical isometry condition is satisfied, the training dynamics behind this contention have not been thoroughly explored. In this work, we study the dynamics of ultra-wide networks across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs) with orthogonal initialization via neural tangent kernel (NTK). Through a series of propositions and lemmas, we prove that two NTKs, one corresponding to Gaussian weights and one to orthogonal weights, are equal when the network width is infinite. Further, during training, the NTK of an orthogonally-initialized infinite-width network should theoretically remain constant. This suggests that the orthogonal initialization cannot speed up training in the NTK (lazy training) regime, contrary to the prevailing thoughts. In order to explore under what circumstances can orthogonality accelerate training, we conduct a thorough empirical investigation outside the NTK regime. We find that when the hyper-parameters are set to achieve a linear regime in nonlinear activation, orthogonal initialization can improve the learning speed with a large learning rate or large depth.
翻译:流行的思维是, 矩形加权对于执行动态等离子测量和加速培训至关重要。 线性网络的正向初始化导致的学习速度的提高已经得到很好的证明。 然而, 虽然在动态等离子测量条件得到满足的情况下, 据认为对非线性网络也持有同样的认识, 但这一争论背后的培训动态还没有得到充分探讨。 在这项工作中, 我们研究一系列结构中超大网络的动态, 包括完全连接网络( FCNs ) 和通过神经红心内核( NTK ) 以正向初始化或直线初始化速度的神经网络( CNNCs ) 。 通过一系列提议和 Lemmmas, 我们证明, 当动态的等同于高斯的重量和直径的重量时, 两个非线性网络的网络是相等的。 此外, 在培训过程中, 一个正向初始化的初始化和直线性网络网络网络网络网络化网络的动态, 从理论上来说应该保持恒定。 这表明, 通过神经初始化初始化的初始化和直径直线性初始化的初始化网络化网络化网络化, 将无法在不断加速进行, 深度的深度研究中, 。