It is well-known that stochastic gradient noise (SGN) acts as implicit regularization for deep learning and is essentially important for both optimization and generalization of deep networks. Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning. However, it turned out that the injected simple random noise cannot work as well as SGN, which is anisotropic and parameter-dependent. For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach that is a powerful alternative to conventional Momentum in classic optimizers. The introduced PNM method maintains two approximate independent momentum terms. Then, we can control the magnitude of SGN explicitly by adjusting the momentum difference. We theoretically prove the convergence guarantee and the generalization advantage of PNM over Stochastic Gradient Descent (SGD). By incorporating PNM into the two conventional optimizers, SGD with Momentum and Adam, our extensive experiments empirically verified the significant advantage of the PNM-based variants over the corresponding conventional Momentum-based optimizers.
翻译:众所周知,悬浮梯度噪音(SGN)是深层学习的隐性规范,对深层网络的优化和普及都十分重要,有些作品试图通过注入随机噪音来人工模拟SGN,以改进深层学习,然而,结果发现,注入的简单随机噪音不能与SGN一样有效,因为SGN是厌养和依赖参数的。为了以低计算成本模拟SGN,而不改变学习率或批量大小,我们提议采用积极-阴性运动(PNM)方法,这是传统优化器中传统潮流的强大替代物。引入的PNM方法保留了两个大致独立的动力条件。然后,我们可以通过调整动力差异来明确控制SGN的规模。我们理论上证明PNM的趋同性保证和普遍优势对斯托切氏基因梯发源(SGD)的影响。我们将PNM与Momentum和Adam这两个常规优化器相结合,我们的广泛实验从经验上证实了PNM的变异体相对于相应的常规潮流优化器的重大优势。