Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $\alpha_L$. We show in a probabilistic setting that with standard i.i.d. initializations, the only non-trivial dynamics is for $\alpha_L = 1/\sqrt{L}$ (other choices lead either to explosion or to identity mapping). This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $\alpha_L = 1/L$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.
翻译:在复杂的机器学习任务中,深ResNet被公认为是实现最先进的成果。然而,这些结构的杰出表现取决于一个需要仔细制定的培训程序,以避免消失或爆炸梯度,特别是由于深度增加,尚未就如何缓解这一问题达成共识,尽管广泛讨论的战略包括将每一层的输出量以一个因数=alpha_L$=L$。我们在一种概率假设设置中显示,根据标准的一.d.初始化,这些结构的唯一非三角动态是$\alpha_L=1/sqrt{L}(其他选择要么导致爆炸,要么导致身份绘图)。这一比例化系数在连续时间限制中与神经分辨差异方程式相对应,而与此相反,深度ResNet是神经普通差异方程式的离散化。相比之下,在后一种概率设置中,通过特定的关联初始初始初始化和美元=1/L$=$(其他选择导致爆炸或身份绘图 ) 。我们的分析表明,在连续时间限制和常规性度参数之间有着强大的相互作用,在最后两个层次之后,我们通过这些持续度的实验中,在两个层次之后,这些层次的连续的实验中,这些层次是连续的连续的实验。