Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently-introduced variant, \emph{persistent} SGD, in a prototypical neural network model. In the under-parametrized regime, where the final training error is positive, the SGD dynamics reaches a stationary state and we define an effective temperature from the fluctuation-dissipation theorem, computed from dynamical mean-field theory. We use the effective temperature to quantify the magnitude of the SGD noise as a function of the problem parameters. In the over-parametrized regime, where the training error vanishes, we measure the noise magnitude of SGD by computing the average distance between two replicas of the system with the same initialization and two different realizations of SGD noise. We find that the two noise measures behave similarly as a function of the problem parameters. Moreover, we observe that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.
翻译:深层学习技术的工作马算法(SGD)是深层学习技术的演算法。在培训阶段的每个阶段,从培训数据集中抽取一小批样本,神经网络的重量根据这个特定实例组的性能进行调整。微型批量抽样程序将一个随机动态带到梯度下降,有非三角状态依赖的噪音。我们把SGD的随机性与最近引入的变异性(emph{persistant} SGD)描述成一个模拟神经网络模型。在最后培训错误为正数的对称系统中,SGD动态会达到一个固定状态,我们根据动态平均场理论计算出一种从波动-分散的正数下降的有效的温度。我们用有效温度量化SGD噪声的大小,作为问题参数的函数。在过度平衡制度中,如果训练错误消失,我们通过计算SGD的两次噪声度参数,我们测量SGD的噪音程度的两次比值,我们通过计算两种平均距离来测量SGD的伸缩度。我们用两种测测测测测的系统是否具有同一程度。