In this paper, we characterize the noise of stochastic gradients and analyze the noise-induced dynamics during training deep neural networks by gradient-based optimizers. Specifically, we firstly show that the stochastic gradient noise possesses finite variance, and therefore the classical Central Limit Theorem (CLT) applies; this indicates that the gradient noise is asymptotically Gaussian. Such an asymptotic result validates the wide-accepted assumption of Gaussian noise. We clarify that the recently observed phenomenon of heavy tails within gradient noise may not be intrinsic properties, but the consequence of insufficient mini-batch size; the gradient noise, which is a sum of limited i.i.d. random variables, has not reached the asymptotic regime of CLT, thus deviates from Gaussian. We quantitatively measure the goodness of Gaussian approximation of the noise, which supports our conclusion. Secondly, we analyze the noise-induced dynamics of stochastic gradient descent using the Langevin equation, granting for momentum hyperparameter in the optimizer with a physical interpretation. We then proceed to demonstrate the existence of the steady-state distribution of stochastic gradient descent and approximate the distribution at a small learning rate.
翻译:在本文中,我们用基于梯度的优化优化器对深神经网络培训过程中的静默梯度噪声进行定性分析,并分析在以梯度为基础的优化器对深神经网络进行培训过程中出现的噪音诱发的动态。具体地说,我们首先表明,静态梯度梯度噪声具有一定差异,因此适用经典中央限值理论(CLT);这表明,梯度噪声是非静态的。这种无静态结果验证了高山噪声这一广泛接受的假设。我们澄清,最近观察到的梯度噪声中的重尾巴现象可能不是内在特性,而是由于微缩缩缩缩小的尺寸造成的;梯度噪声,是有限的i.d.随机变量之和,尚未达到CLT的静态定值体系,因此与Gausian不同。我们量化测量高斯噪音近似的美度,这支持我们的结论。第二,我们利用兰氏方程式分析由噪音引起的梯度梯度梯度梯度下降的动态动态动态,允许在优化器中形成动力超常分数计,然后以物理判分度速度的基度分布,然后开始以稳定地向低度分布。