Ada belief 优化:受观察的梯度信仰的适应步骤 (AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients)

Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability.We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the "belief" in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer. Code is available at https://github.com/juntang-zhuang/Adabelief-Optimizer

翻译：最受欢迎的深层学习优化者可以被广泛归类为适应方法(例如亚当)和加速计划(例如具有动力的随机梯度梯度下降(SGD) 。对于许多模型,如神经神经神经网络(CNNs),适应方法一般会更快地趋同,但比SGD要差得多;对于基因化对抗网络(GANs)等复杂环境,适应方法通常会因其稳定性而违约。我们建议Adabelief同时实现三个目标:适应方法的快速趋同、SGD中的良好概括化以及培训稳定性。Adabelief的直觉是根据当前梯度方向的“信条”调整步骤。将噪音梯度的指数移动平均数(EMA)作为下一个步骤梯度的预测,如果观察到的梯度与预测有很大差异,我们不信任目前的观察,采取一个小步骤;如果观察到的梯度接近于预测,我们相信它,并采取一个大步骤。我们在广泛的实验中确认Adabelief,显示它超越了当前梯度方向的“Berief”的“Beal Aral-dealalalalalal、Geal-deal-dealizervial 和Serviewerview 。在Gial 的高级图像的精确度上,在GAral-deal-deabildal-de a 和S-deabildaldaldaldaldalvialdaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldald 上,在GDA 上取得一个高的高级图像的高级的精确和GDA 和GDA 和高的精确性、GDBAb 和高的精确级的精确化方法上,在GGGGDA 上,在GGGGGGGGGDGDGDAbald的精确图上,在G的高级的高级的高级的精确的精确的精确图上,在GDAb的精确图上,在GDBABABA和高。