Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In this paper we propose a novel, robust and accelerated stochastic optimizer that relies on two key elements: (1) an accelerated Nesterov-like Stochastic Differential Equation (SDE) and (2) its semi-implicit Gauss-Seidel type discretization. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively in the case of the minimization of a quadratic function. This analysis allows us to come up with an optimal step size (or learning rate) in terms of rate of convergence while ensuring the stability of NAG-GS. This is achieved by the careful analysis of the spectral radius of the iteration matrix and the covariance matrix at stationarity with respect to all hyperparameters of our method. We show that NAG-GS is competitive with state-of-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models such as the logistic regression model, the residual networks models on standard computer vision datasets, and Transformers in the frame of the GLUE benchmark.
翻译:古典 SGD 可以被解释为随机梯度流的离散。 在本文中,我们提出了一个新颖、强健和加速的随机优化器,该优化器依赖于两个关键要素:(1) 加速的Nesterov-类似Stochatic 差异方程式(SDE)和(2) 其半隐含的Gaps-Seidel类型离散。在最小化四边形函数的情况下,首先广泛研究称为NAG-GS(SGD)的方法的趋同和稳定性。这一分析使我们能够在趋同率方面找到最佳的步数(或学习率),同时确保NAG-GS的稳定。这是通过仔细分析循环矩阵的频谱半半光度和相对于我们方法的所有超度度度计的可变性矩阵实现的。我们表明,NAGG-GS(NGAG-GS)与最先进的方法相比具有竞争力,例如以SGDMT为基准的SBS-L模型、以SGD为基准的SGD和Adam-W模型的模型,例如以SGL为基准的SBRIL模型的模型,以SBL标准的模型进行学习。