The training success, training speed and generalization ability of neural networks rely crucially on the choice of random parameter initialization. It has been shown for multiple architectures that initial dynamical isometry is particularly advantageous. Known initialization schemes for residual blocks, however, miss this property and suffer from degrading separability of different inputs for increasing depth and instability without Batch Normalization or lack feature diversity. We propose a random initialization scheme, RISOTTO, that achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width. It balances the contributions of the residual and skip branches unlike other schemes, which initially bias towards the skip connections. In experiments, we demonstrate that in most cases our approach outperforms initialization schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit, and facilitates stable training. Also in combination with Batch Normalization, we find that RISOTTO often achieves the overall best result.
翻译:神经网络的培训成功、培训速度和一般化能力主要取决于随机参数初始化的选择。对于多个结构来说,已经表明初始动态等离子测量特别有利。然而,已知的剩余区块初始化计划遗漏了这一属性,并且由于不同投入的深度和不稳定性而导致深度和不稳定性增加而没有批量正常化或特征多样性。我们提出了一个随机初始化计划(RICOTTO),它为残余网络实现了完全的动态等离子测量,即使在有限的深度和宽度上也实现了ReLU激活功能。它平衡了剩余部分的贡献,并跳过分支,而与其他计划不同,后者最初偏向跳过连接。在实验中,我们证明,在大多数情况下,我们的方法超越了拟议使批正常化计划过时的初始化计划,包括修复和跳过 Init,并促进稳定的培训。我们发现,与Batch正常化相结合,RICOTO常常取得总体最佳结果。