We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor $\tau =O(1/\sqrt{L})$ to guarantee stable forward/backward process, where $L$ is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when $\tau>L^{-\frac{1}{2}+c}$, for any positive constant $c$. The above two results together establish a sharp value of the scaling factor in determining the stability of deep ResNet. Based on the stability result, we further show that gradient descent finds the global minima if the ResNet is properly over-parameterized, which significantly improves over the previous work with a much larger range of $\tau$ that admits global convergence. Moreover, we show that the convergence rate is independent of the depth, theoretically justifying the advantage of ResNet over vanilla feedforward network. Empirically, with such a factor $\tau$, one can train deep ResNet without normalization layer. Moreover, for ResNets with normalization layer, adding such a factor $\tau$ also stabilizes the training and obtains significant performance gain for deep ResNet.
翻译:我们用梯度下降来研究深RESNet培训的稳定性和趋同性。 具体地说, 我们显示, 剩余区块的参数分支应该以一个因数$\toau = O( 1/\\ sqrt{L}) 来缩小, 以保证稳定的前向/ 后向进程, 美元是剩余区块的数量。 此外, 我们得出了一个相反的结果, 当美元=Tau>L ⁇ \\\\\ frac{1\\\2 ⁇ c} 出现任何正数时, 前向进程将不受约束。 以上两个结果加在一起, 确定了确定深ResNet稳定性的缩放因素的锐值。 基于稳定性结果, 我们进一步显示, 如果 ResNet 正确过量地过量的参数, 则梯度下降会找到全球迷你。 这大大改进了先前的工作, 当美元范围大得多, 允许全球趋同。 此外, 我们表明, 趋同率是独立于深度的, 从理论上证明ResNet的优势, 超过香草饲料网的优势。 此外, $\ tareal real realalalation realationalation resNet restualationalations