Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.
翻译:在这项工作中,我们力求在更简单的线性回归(包括分度过低和过度分度制度)设置中理解这些问题,我们的目标是对(非常规)平均SGD提供的隐性回归(包括分度过低和过度分度制度)进行急剧的基于实例的比较,同时对脊脊回归进行明确的规范化。对于广义的平方问题案例(在高维环境中是自然的),我们展示:(1) 对于每个问题实例和每个脊脊参数,(非常规)SGD,当我们向线性回归(包括分度过低和超度偏度制度)提供比向峰性回归法提供的对数更多的样本时,我们力求理解这些问题。 (2) 反之,有些(在这种广泛的问题类别中),最佳调整的脊重回归需要比SGD多的样本,以便具有相同的概括性表现。 一起是,我们的结果甚至更糟糕的是,在更简单的正标性回归法方面,在更精确的精确性因素中,在更普遍的正标性回归中,在更精确的精确性因素中,在更差的回归中,在更难于更精确的精确的精确性因素中,在更甚于更深的回归性因素中,在总的精确性因素中,在更甚于更深的回归性因素中,在更甚于更深的回归性因素中,在更甚于更甚于更甚于更深地的精确性上。