In deep learning it is common to overparameterize neural networks, that is, to use more parameters than training samples. Quite surprisingly training the neural network via (stochastic) gradient descent leads to models that generalize very well, while classical statistics would suggest overfitting. In order to gain understanding of this implicit bias phenomenon we study the special case of sparse recovery (compressed sensing) which is of interest on its own. More precisely, in order to reconstruct a vector from underdetermined linear measurements, we introduce a corresponding overparameterized square loss functional, where the vector to be reconstructed is deeply factorized into several vectors. We show that, if there exists an exact solution, vanilla gradient flow for the overparameterized loss functional converges to a good approximation of the solution of minimal $\ell_1$-norm. The latter is well-known to promote sparse solutions. As a by-product, our results significantly improve the sample complexity for compressed sensing via gradient flow/descent on overparameterized models derived in previous works. The theory accurately predicts the recovery rate in numerical experiments. Our proof relies on analyzing a certain Bregman divergence of the flow. This bypasses the obstacles caused by non-convexity and should be of independent interest.
翻译:在深层学习中,过分区分神经网络是常见的,即使用比培训样本更多的参数。相当令人惊讶的是,通过(随机)梯度下降对神经网络进行培训导致模型非常普遍化,而古典统计则表明超称。为了了解这种隐含的偏差现象,我们研究的是稀释恢复(压缩感知)的特殊案例,这本身就令人感兴趣。更确切地说,为了从不确定的线性测量中重建一个矢量,我们引入了一个相应的过分区分的平方损失功能,在这个功能中,要重建的矢量被深入地分解成若干矢量。我们表明,如果存在精确的解决方案,那么过分度损失功能的香草梯度流动就会聚集到一个非常接近的最小的 $ ell_ 1$- 诺尔姆的解决方案上。 众所周知, 后者会促进稀释的解决办法。 作为一种副产品,我们的结果大大改进了通过梯度流/白度过宽的模型进行压缩感测的样本复杂性。 理论准确地预测了若干矢量实验中的回收率。我们的证据应依据分析某位障碍。</s>