Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the method of choice for learning with large over-parameterized models. A popular theory for explaining why SGD works well in practice is that the algorithm has an implicit regularization that biases its output towards a good solution. Perhaps the theoretically most well understood learning setting for SGD is that of Stochastic Convex Optimization (SCO), where it is well known that SGD learns at a rate of $O(1/\sqrt{n})$, where $n$ is the number of samples. In this paper, we consider the problem of SCO and explore the role of implicit regularization, batch size and multiple epochs for SGD. Our main contributions are threefold: (a) We show that for any regularizer, there is an SCO problem for which Regularized Empirical Risk Minimzation fails to learn. This automatically rules out any implicit regularization based explanation for the success of SGD. (b) We provide a separation between SGD and learning via Gradient Descent on empirical loss (GD) in terms of sample complexity. We show that there is an SCO problem such that GD with any step size and number of iterations can only learn at a suboptimal rate: at least $\widetilde{\Omega}(1/n^{5/12})$. (c) We present a multi-epoch variant of SGD commonly used in practice. We prove that this algorithm is at least as good as single pass SGD in the worst case. However, for certain SCO problems, taking multiple passes over the dataset can significantly outperform single pass SGD. We extend our results to the general learning setting by showing a problem which is learnable for any data distribution, and for this problem, SGD is strictly better than RERM for any regularization function. We conclude by discussing the implications of our results for deep learning, and show a separation between SGD and ERM for two layer diagonal neural networks.
翻译:多角度、 小型、 多角度、 小型的 SGD 的学习环境( SGD ) 一直是以大过分度模型( SGD) 学习的首选方法。 本文中, 我们考虑SGD 的问题, 并探索SGD 隐含的正规化、 批量大小和多重缩略图的作用。 我们的主要贡献有三重:(a) 我们显示,对于任何常规化的 SGD,存在一个在正常化的 Epirate 优化( SCO) 的学习环境。 SGD 以 $( 1/\ sqrt{n} ( SGD ) 的速度学习, 而对于SGD 成功与否, 则自动排除任何隐含的正规化解释。 (b) 我们通过 SGD 和 梯级变压的变压结果( GGD ) 在SGD 的常规变压数据流化中, 显示, SO 以最低的缩略图显示, 在SO 格式变压数据流中, 我们只能通过 SGL 的缩略图解。