We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.
翻译:我们研究的是,在多大程度上可将悬浮梯度下降(SGD)理解为一种 " 常规 " 学习规则,该规则通过获得适合培训的数据而实现普遍化绩效。我们认为,基本的悬浮相向 convex优化框架,即(一次通过,不替换)SGD传统上已知以O(1/\sqrt n)美元利率将人口风险降至最低程度,并证明,令人惊讶的是,存在SGD解决方案既存在经验风险,也存在普遍化差距($\Omega(1)美元)的问题。因此,我们发现SGD在任何意义上都没有逻辑上的稳定,其普遍化能力不能通过统一趋同或任何其他目前已知的关于该事项的通用约束技术(除了其传统分析之外)加以解释。然后,我们继续分析与更替SGDGD密切相关的紧密关系,我们为此表明,一个类似现象并没有发生,并证明其人口风险事实上与最佳比率一致。最后,我们从不替换SGD以大幅缩压质优化制度的主要结果的角度来解释我们的主要结果,即,在已知的有限总缩压后,在先前的多层结果上作出改进。