We investigate the convergence and convergence rate of stochastic training algorithms for Neural Networks (NNs) that have been inspired by Dropout (Hinton et al., 2012). With the goal of avoiding overfitting during training of NNs, dropout algorithms consist in practice of multiplying the weight matrices of a NN componentwise by independently drawn random matrices with $\{0, 1 \}$-valued entries during each iteration of Stochastic Gradient Descent (SGD). This paper presents a probability theoretical proof that for fully-connected NNs with differentiable, polynomially bounded activation functions, if we project the weights onto a compact set when using a dropout algorithm, then the weights of the NN converge to a unique stationary point of a projected system of Ordinary Differential Equations (ODEs). After this general convergence guarantee, we go on to investigate the convergence rate of dropout. Firstly, we obtain generic sample complexity bounds for finding $\epsilon$-stationary points of smooth nonconvex functions using SGD with dropout that explicitly depend on the dropout probability. Secondly, we obtain an upper bound on the rate of convergence of Gradient Descent (GD) on the limiting ODEs of dropout algorithms for NNs with the shape of arborescences of arbitrary depth and with linear activation functions. The latter bound shows that for an algorithm such as Dropout or Dropconnect (Wan et al., 2013), the convergence rate can be impaired exponentially by the depth of the arborescence. In contrast, we experimentally observe no such dependence for wide NNs with just a few dropout layers. We also provide a heuristic argument for this observation. Our results suggest that there is a change of scale of the effect of the dropout probability in the convergence rate that depends on the relative size of the width of the NN compared to its depth.
翻译:本文研究了受Hinton等人(2012)启发的用于神经网络(NNs)的随机训练算法的收敛性和收敛速度。为了在NNs的训练过程中避免过拟合,dropout算法实际上通常指的是在Stochastic Gradient Descent(SGD)的每次迭代中将NN组件的权重矩阵与独立绘制的随机矩阵逐元素相乘。本文呈现了一个概率论证明:对于具有可微分,多项式上界激活函数的全连接NNs,如果在使用drop算法时将权重投影到紧致集上,则NN的权重会收敛于投影常微分方程系统的唯一稳态点。在这个通用的收敛保证之后,我们继续研究dropout的收敛速率。首先,我们使用dropout获得了找到光滑非凸函数的$\epsilon$-稳定点的通用样本复杂度上限,该上限显式依赖于dropout概率。其次,我们获得了在具有任意深度的树形状且具有线性激活函数的神经网络的极限ODEs上运行梯度下降(GD)的收敛速率的上限。后一上限表明,对于像Dropout或Dropconnect(Wan等人,2013)这样的算法,收敛速率可以由树形深度呈指数快速下降。相比之下,我们实验观察到,对于具有只有几个dropout层的宽NNs,不存在这样的依赖关系。我们还提供了一个启发性的论据来解释这一观察结果。本文的结果表明,dropout概率的影响尺度的变化与NN的宽度相对于其深度的大小有关。