We develop a novel framework that adds the regularizers of the sparse group lasso to a family of adaptive optimizers in deep learning, such as Momentum, Adagrad, Adam, AMSGrad, AdaHessian, and create a new class of optimizers, which are named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group AdaHessian, etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the regularized effect of our new optimizers on three large-scale real-world ad click datasets with state-of-the-art deep learning models. The experimental results reveal that compared with the original optimizers with the post-processing procedure which uses the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, our methods can achieve extremely high sparsity with significantly better or highly competitive performance.
翻译:我们开发了一个新的框架,将稀有群体的正规化者加入深层学习的适应性优化者大家庭,如Momentum、Adagrad、Adam、AMSGrad、AdaHessian等,并创建了新型优化者,称为Group Momentum、Group Adadad、Group Adam、Group AmSGrad和Group AdaHessian等。我们根据原始-双向方法,在沙沙发性锥形结构中建立了理论上经证明的趋同保证。我们评估了我们的新优化者在三种大型真实世界广告中与最先进的深层学习模型一起点击数据集的正规化效应。实验结果表明,与使用规模裁剪法的后处理程序的原始优化者相比,模型的性能可以在同一微粒度水平上大大改进。此外,与没有规模裁剪裁的个案相比,我们的方法可以达到极高的吸附性,而且效果要好得多或竞争性强得多。