Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? In this work, we make a series of empirical observations that investigate the hypothesis that deeper networks are inductively biased to find solutions with lower rank embeddings. We conjecture that this bias exists because the volume of functions that maps to low-rank embedding increases with depth. We show empirically that our claim holds true on finite width linear and non-linear models and show that these are the solutions that generalize well. We then show that the low-rank simplicity bias exists even after training, using a wide variety of commonly used optimizers. We found this phenomenon to be resilient to initialization, hyper-parameters, and learning methods. We further demonstrate how linear over-parameterization of deep non-linear models can be used to induce low-rank bias, improving generalization performance without changing the effective model capacity. Practically, we demonstrate that simply linearly over-parameterizing standard models at training time can improve performance on image classification tasks, including ImageNet.
翻译:现代深心神经网络与培训它们的数据相比,高度超强的偏差度过高,但它们往往非常笼统。最近一阵子的工作问:为什么深心网络没有过度适应其培训数据?在这项工作中,我们进行了一系列经验观察,调查更深网络以低级嵌入为寻找解决方案的假设。我们推测,这种偏差之所以存在,是因为地图显示低级嵌入量和深度增加的功能数量巨大。我们从经验上表明,我们的要求在有限的宽线性和非线性模型上是真实的,并表明这些是十分普遍的解决方案。我们然后表明,即使经过培训,低层次的简单性偏差也存在,使用各种常用的优化工具。我们发现,这种现象具有适应初始化、超分数计和学习方法的弹性。我们进一步证明,深度非线性超度模型的线性超度化可以用来诱导低级偏差,改进一般化性能而不会改变有效的模型能力。我们实际地证明,在培训时间仅仅线性超分标准模型,包括图像网络的性能可以改进图像分类。