Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? We investigate the hypothesis that deeper nets are implicitly biased to find lower rank solutions and that these are the solutions that generalize well. We prove for the asymptotic case that the percent volume of low effective-rank solutions increases monotonically as linear neural networks are made deeper. We then show empirically that our claim holds true on finite width models. We further empirically find that a similar result holds for non-linear networks: deeper non-linear networks learn a feature space whose kernel has a lower rank. We further demonstrate how linear over-parameterization of deep non-linear models can be used to induce low-rank bias, improving generalization performance without changing the effective model capacity. We evaluate on various model architectures and demonstrate that linearly over-parameterized models outperform existing baselines on image classification tasks, including ImageNet.
翻译:现代深心神经网络与所培训的数据相比,高度超分度的现代深度神经网络与它们所培训的数据相比是高度超分的,但它们往往非常笼统。最近一阵子的工作问:深心网络为什么没有过度适应其培训数据?我们调查了深网被暗含偏向以找到低级解决方案的假设,这些是十分笼统的解决办法。我们证明,在无症状的情况下,随着线性神经网络的深度扩大,低有效级解决方案的百分率会增加单数。我们随后从经验上表明,我们的要求在有限宽度模型上是真实的。我们进一步从经验上发现,非线性网络也有类似的结果:更深的非线性网络学习了一个其内核值较低的地貌空间。我们进一步证明,如何利用深非线性非线性超分度模型来诱发低层次的偏差,在不改变有效模型能力的情况下改进总体性表现。我们评估了各种模型结构,并证明线性过准模型超越了包括图像网络在内的现有图像分类基准。