We give a new separation result between the generalization performance of stochastic gradient descent (SGD) and of full-batch gradient descent (GD) in the fundamental stochastic convex optimization model. While for SGD it is well-known that $O(1/\epsilon^2)$ iterations suffice for obtaining a solution with $\epsilon$ excess expected risk, we show that with the same number of steps GD may overfit and emit a solution with $\Omega(1)$ generalization error. Moreover, we show that in fact $\Omega(1/\epsilon^4)$ iterations are necessary for GD to match the generalization performance of SGD, which is also tight due to recent work by Bassily et al. (2020). We further discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
翻译:虽然对于SGD来说,众所周知,美元(1/\ epsilon2)2美元循环足以解决超额预期风险,但我们表明,用同样数量的步骤,GD可能会用美元(Omega(1))1美元的一般化错误而过度适用和释放一个解决方案。此外,我们表明,事实上,GD需要用美元(1/\ epsilon4)来匹配SGD的普遍化绩效,而由于Bassily等人(202020年)最近的工作,SGD的普通化绩效也十分紧张。我们进一步讨论如何使GD所最大限度地减少的经验风险的正规化基本上不会改变上述结果,并重新审视稳定性、隐含的偏差和学习算法在普遍化中的作用等概念。