Recent empirical work on stochastic gradient descent (SGD) applied to over-parameterized deep learning has shown that most gradient components over epochs are quite small. Inspired by such observations, we rigorously study properties of Truncated SGD (T-SGD), that truncates the majority of small gradient components to zeros. Considering non-convex optimization problems, we show that the convergence rate of T-SGD matches the order of vanilla SGD. We also establish the generalization error bound for T-SGD. Further, we propose Noisy Truncated SGD (NT-SGD), which adds Gaussian noise to the truncated gradients. We prove that NT-SGD has the same convergence rate as T-SGD for non-convex optimization problems. We demonstrate that with the help of noise, NT-SGD can provably escape from saddle points and requires less noise compared to previous related work. We also prove that NT-SGD achieves better generalization error bound compared to T-SGD because of the noise. Our generalization analysis is based on uniform stability and we show that additional noise in the gradient update can boost the stability. Our experiments on a variety of benchmark datasets (MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100) with various networks (VGG and ResNet) validate the theoretical properties of NT-SGD, i.e., NT-SGD matches the speed and accuracy of vanilla SGD while effectively working with sparse gradients, and can successfully escape poor local minima.
翻译:最近对超度深层学习应用的悬浮梯度梯度下沉(SGD)的经验性工作表明,跨时代的梯度部分大多相当小。在这种观察的启发下,我们严格地研究SGD(T-SGD)的特性,将大多数小梯度部分挤到零;考虑到非Convex优化问题,我们表明,T-SGD的趋同率与香草SGD的顺序相符。我们还为T-SGD定了一个普遍化错误。此外,我们提议,SGD(NT-SGD)的流行性调整性 SGD(NT-SGD)的精确性能使高斯氏的噪音添加到疏松动梯度。我们证明,NTSGD的趋同率与TSGD(T-SGD)的趋同率(T-SGD)的趋同率(TGGD(T-SG)的精确性能比TGD(ND(ND-SG-SG)的精确性差错。我们关于SGR(GR-GR-GR)的精确度和SLIDRLID(S-GR)的精确性变新的稳定性分析,我们关于S-GLILM)的更稳定性能的更近的精确性能和CLILILILILD(我们关于S-CR)的精确性能能的比,我们关于S-CR)的精确性能和CR)的精确性能的精确性能的精确性能。