In this paper, we study the implicit bias of gradient descent for sparse regression. We extend results on regression with quadratic parametrization, which amounts to depth-2 diagonal linear networks, to more general depth-N networks, under more realistic settings of noise and correlated designs. We show that early stopping is crucial for gradient descent to converge to a sparse model, a phenomenon that we call implicit sparse regularization. This result is in sharp contrast to known results for noiseless and uncorrelated-design cases. We characterize the impact of depth and early stopping and show that for a general depth parameter N, gradient descent with early stopping achieves minimax optimal sparse recovery with sufficiently small initialization and step size. In particular, we show that increasing depth enlarges the scale of working initialization and the early-stopping window so that this implicit sparse regularization effect is more likely to take place.
翻译:在本文中,我们研究梯度下降的隐含偏差,以偏差回归偏差为稀薄回归。我们把回归结果(即深度2至二对角线性网络)推广到更普遍的深度-N网络,在更现实的噪音和关联设计环境下进行。我们表明,早期停止对于梯度下降归为稀薄模式至关重要,我们称之为隐性稀释现象。这与无噪音和无孔不入设计案例的已知结果形成鲜明对比。我们描述深度和早期停用的影响,并表明对于一般深度参数N而言,提前停止的梯度下降能够达到最小最小的稀薄恢复,而初始化和步骤大小则足够小。我们尤其表明,深度的提高扩大了工作初始化和早期停用窗口的规模,从而使得这种隐性稀释的规范效应更有可能发生。