Despite the empirical advances of deep learning across a variety of learning tasks, our theoretical understanding of its success is still very restricted. One of the key challenges is the overparametrized nature of modern models, enabling complete overfitting of the data even if the labels are randomized, i.e. networks can completely \textit{memorize} all given patterns. While such a memorization capacity seems worrisome, in this work we show that under training protocols that include \textit{data augmentation}, neural networks learn to memorize entirely random labels in a benign way, i.e. they learn embeddings that lead to highly non-trivial performance under nearest neighbour probing. We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers. As a result, only the very last layers are used for memorization, while preceding layers encode performant features which remain largely unaffected by the label noise. We explore the intricate role of the augmentations used for training and identify a memorization-generalization trade-off in terms of their diversity, marking a clear distinction to all previous works. Finally, we give a first explanation for the emergence of benign memorization by showing that \textit{malign} memorization under data augmentation is infeasible due to the insufficient capacity of the model for the increased sample size. As a consequence, the network is forced to leverage the correlated nature of the augmentations and as a result learns meaningful features. To complete the picture, a better theory of feature learning in deep neural networks is required to fully understand the origins of this phenomenon.
翻译:尽管在各种学习任务中深层学习的经验进步,但我们对其成功经验的理论理解仍然非常有限。关键的挑战之一是现代模型的超称性能,使得数据完全超配,即使标签是随机的,即网络可以完全的\ textit{momorize}所有给定模式。虽然这种记忆能力似乎令人担忧,但我们在这项工作中显示,在包含\ textit{ data 扩增} 的培训协议下,神经网络学会以无害的方式将完全随机的标签混为一谈,即它们学习了导致在近邻勘测下高度非理论性表现的内嵌性{。我们证明深层次的模型能够将噪音与信号区分开来,通过将记忆化的任务和特征学习到不同的层次。因此,只有最后的层次用于记忆化,而前层的模型性能基本上不受标签噪音的影响。我们探索了用于培训的扩大的复杂作用,并确定了在深度测试中高度的非理论性性性性性性性性性性性能。 深度的网络的深度性能性能使得我们能够更清楚地了解之前的变现。