We provide sharp path-dependent generalization and excess error guarantees for the full-batch Gradient Decent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is a new technique for bounding the generalization error of deterministic symmetric algorithms, which implies that average output stability and a bounded expected gradient of the loss at termination lead to generalization. This key result shows that small generalization error occurs at stationary points, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, Polyak-Lojasiewicz (PL), convex and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, under the proper choice of a decreasing step size. Further, if the loss is nonconvex but the objective is PL, we derive quadratically vanishing bounds on the generalization error and the corresponding excess risk, for a choice of a large constant step size. For (resp. strongly-) convex smooth losses, we prove that full-batch GD also generalizes for large constant step sizes, and achieves (resp. quadratically) small excess risk while training fast. In all cases, our full-batch GD generalization error and excess risk bounds are strictly tighter than existing bounds for (stochastic) GD, when the loss is smooth (but possibly non-Lipschitz).
翻译:我们的分析核心是将确定性对称算法的总体错误捆绑起来的新方法,这意味着平均产出稳定性和终止时损失的受约束的预期梯度会导致总体化。这个关键结果显示,小一般化错误发生在固定点,并使我们能够绕过Lipschitz或Gaussian对以往工程中普遍存在的损失所作的假设。对于非Convex、Polyak-Lojasiewicz(PL)、Convex和强烈convex损失,我们的分析核心是一种将确定性对称的对称算算算算法的典型错误捆绑在一起的新技术,这意味着平均产出稳定性和终止时损失的预期梯度会导致总体化。对于非Convex平稳损失,我们证明,完全性GD(在正确选择步骤大小的情况下,全速化GD)接近于任何固定点。此外,在平稳度上,最大幅度的Loicereadread-deferal develrial develrial develrial.