The gradient noise of Stochastic Gradient Descent (SGD) is considered to play a key role in its properties (e.g. escaping low potential points and regularization). Past research has indicated that the covariance of the SGD error done via minibatching plays a critical role in determining its regularization and escape from low potential points. It is however not much explored how much the distribution of the error influences the behavior of the algorithm. Motivated by some new research in this area, we prove universality results by showing that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties. We mainly consider the Multiplicative Stochastic Gradient Descent (M-SGD) algorithm as introduced by Wu et al., which has a much more general noise class than the SGD algorithm done via minibatching. We establish nonasymptotic bounds for the M-SGD algorithm mainly with respect to the Stochastic Differential Equation corresponding to SGD via minibatching. We also show that the M-SGD error is approximately a scaled Gaussian distribution with mean $0$ at any fixed point of the M-SGD algorithm. We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
翻译:沙粒梯度底部(SGD)的梯度噪音被认为在其特性中发挥着关键作用(例如,逃避低潜在点和正规化)。过去的研究显示,通过微型粘合而成的SGD错误的共变性在确定其正规化和从低潜在点逃脱方面起着关键作用。然而,对于错误的分布在多大程度上影响算法的行为,并没有进行多少的探讨。受该领域一些新研究的推动,我们通过微型粘合而证明,具有SGD相同中值和变异结构的噪音类别具有类似的特性,从而证明普遍性的结果。我们主要认为,Wu等人采用的多复制性随机源(M-SGD)算法(M-SGD)算法(M-SGD)具有比SGD算法(SGD算法)更一般得多的噪音类别。我们为M-SGD算法(M-SGD)算法(M-SGD)的任何固定点上,M-SG-D(M-D)算法(O-M-D)也以0.D(O-M-SG-D)的硬基调合。