It is well-known that stochastic gradient noise (SGN) acts as implicit regularization for deep learning and is essentially important for both optimization and generalization of deep networks. Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning. However, it turned out that the injected simple random noise cannot work as well as SGN, which is anisotropic and parameter-dependent. For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach that is a powerful alternative to conventional Momentum in classic optimizers. The introduced PNM method maintains two approximate independent momentum terms. Then, we can control the magnitude of SGN explicitly by adjusting the momentum difference. We theoretically prove the convergence guarantee and the generalization advantage of PNM over Stochastic Gradient Descent (SGD). By incorporating PNM into the two conventional optimizers, SGD with Momentum and Adam, our extensive experiments empirically verified the significant advantage of the PNM-based variants over the corresponding conventional Momentum-based optimizers. Code: \url{https://github.com/zeke-xie/Positive-Negative-Momentum}.
翻译:众所周知,随机性梯度噪音(SGN)是深层学习的隐性规范,对深层网络的优化和普及都具有根本重要性。有些作品试图通过注入随机噪音来人工模拟SGN,以改进深层学习。然而,结果发现,注入的简单随机噪音与SGN是不能同时起作用的,因为SGN是厌养和参数依赖的。为了以低计算成本模拟SGN,而不改变学习率或批量大小,我们提议采用积极-否定性运动(PNM)方法,这是传统优化剂中传统潮流的强大替代物。引入的PNM方法保持了两个大致独立的动力条件。然后,我们可以通过调整动力差异来明确控制SGNT的规模。我们理论上证明PNM对托氏基因梯分源(SGD)的趋同性和普遍性优势。我们将PNMM与运动和亚当两种常规优化剂(SGDD)结合,我们的广泛实验从经验上证实了PNM的变体相对于相应的常规湿质优化剂(Moumsi)/MQorimaltistriaxi-cotium)的显著优势。守则:Musmation:Myaltium/Mylmentalmentaltium/Mististry/Mistium/Mistitionaltitionalticxionalticxionalxy.codecode.code:m.