While momentum-based methods, in conjunction with stochastic gradient descent (SGD), are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods. In this work, we first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. Then, for smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM), and show that it admits an upper-bound on the generalization error. Thus, our results show that machine learning models can be trained for multiple epochs of SGDEM with a guarantee for generalization. Finally, for the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes. Extending our results on generalization, we also develop an upper-bound on the expected true risk, in terms of the number of training steps, the size of the training set, and the momentum parameter. Our experimental evaluations verify the consistency between the numerical results and our theoretical bounds. SGDEM improves the generalization error of SGDM when training ResNet-18 on ImageNet in practical distributed settings.
翻译:虽然在培训机器学习模型时广泛使用基于动力的方法,同时使用基于动力的梯度下降法(SGD),但在理论上很少理解这类方法的普遍错误。在这项工作中,我们首先表明存在着一种螺旋损失功能,即SGD的多重阶段与标准的重球动力(SGDM)的稳定性差距没有限制。然后,为了顺畅的Lipschitz损失功能,我们分析了经修改的基于动力的更新规则,即具有早期动力的SGD(SGDD),并表明它承认了普遍错误的上限。因此,我们的结果显示,机器学习模型可以针对SGDEM的多重阶段进行培训,保证其普遍性。 最后,对于具有高强度的螺旋损失动力的多重阶段,我们发现一系列的势头,即标准SGDM的多重阶段,作为SGDEM的一种特殊形式,也笼统地概括地分析。 扩大我们关于普遍性的SGDG(SG)M培训步骤的数量,以及我们GDM培训的理论性水平,在SAR-18的理论性标准设置中,在我们的理论-SGDM培训中,我们一般的数值分析结果的分布上,我们也以预期的真正风险为上限。