Machine learning models trained with \emph{stochastic} gradient descent (SGD) can generalize better than those trained with deterministic gradient descent (GD). In this work, we study SGD's impact on generalization through the lens of the statistical bootstrap: SGD uses gradient variability under batch sampling as a proxy for solution variability under the randomness of the data collection process. We use empirical results and theoretical analysis to substantiate this claim. In idealized experiments on empirical risk minimization, we show that SGD is drawn to parameter choices that are robust under resampling and thus avoids spurious solutions even if they lie in wider and deeper minima of the training loss. We prove rigorously that by implicitly regularizing the trace of the gradient covariance matrix, SGD controls the algorithmic variability. This regularization leads to solutions that are less sensitive to sampling noise, thereby improving generalization. Numerical experiments on neural network training show that explicitly incorporating the estimate of the algorithmic variability as a regularizer improves test performance. This fact supports our claim that bootstrap estimation underpins SGD's generalization advantages.
翻译:使用随机梯度下降(SGD)训练的机器学习模型比使用确定性梯度下降(GD)训练的模型具有更好的泛化能力。本文通过统计自举的视角研究SGD对泛化的影响:SGD利用批次抽样下的梯度变异性作为数据收集过程随机性下解变异性的代理。我们通过实证结果和理论分析来证实这一观点。在经验风险最小化的理想化实验中,我们表明SGD倾向于选择在重抽样下稳健的参数,从而避免伪解,即使这些解位于训练损失更宽更深的极小值中。我们严格证明,通过隐式正则化梯度协方差矩阵的迹,SGD控制了算法变异性。这种正则化使得解对抽样噪声更不敏感,从而提升泛化性能。神经网络训练中的数值实验表明,显式地将算法变异性估计作为正则项可提高测试性能。这一事实支持了我们的主张,即自举估计是SGD泛化优势的基础。