Benign Memorization 令人好奇的案件 (The Curious Case of Benign Memorization)

Despite the empirical advances of deep learning across a variety of learning tasks, our theoretical understanding of its success is still very restricted. One of the key challenges is the overparametrized nature of modern models, enabling complete overfitting of the data even if the labels are randomized, i.e. networks can completely memorize all given patterns. While such a memorization capacity seems worrisome, in this work we show that under training protocols that include data augmentation, neural networks learn to memorize entirely random labels in a benign way, i.e. they learn embeddings that lead to highly non-trivial performance under nearest neighbour probing. We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers. As a result, only the very last layers are used for memorization, while preceding layers encode performant features which remain largely unaffected by the label noise. We explore the intricate role of the augmentations used for training and identify a memorization-generalization trade-off in terms of their diversity, marking a clear distinction to all previous works. Finally, we give a first explanation for the emergence of benign memorization by showing that malign memorization under data augmentation is infeasible due to the insufficient capacity of the model for the increased sample size. As a consequence, the network is forced to leverage the correlated nature of the augmentations and as a result learns meaningful features. To complete the picture, a better theory of feature learning in deep neural networks is required to fully understand the origins of this phenomenon.

翻译：尽管在各种学习任务中深层学习的经验进步,但我们对其成功经验的理论理解仍然非常有限。关键的挑战之一是现代模型的过度平衡性质,使得数据完全超配,即使标签是随机的,即网络可以完全记忆所有给定模式。尽管这种记忆能力似乎令人担忧,但我们在这项工作中表明,在包括数据增强在内的培训协议下,神经网络学会以友好的方式将完全随机的标签进行记忆化,即它们学会了导致在近邻勘测下高度非三角性表现的嵌入。我们证明深层模型能够将噪音从信号中分离出来,即使标签是随机的,即网络可以将所有给定型模式进行完全的记忆化。因此,只有最后的层用于记忆化,而前层的编码性能在很大程度上不受标签噪音的影响。我们探索了用于培训的增强的复杂作用,并确定了以其完全非三角性表现在近邻勘测的状态下进行高度的非三角化的贸易。我们发现深层模型能够将信号从信号中分离出来信号,将一个清晰的信号从记忆中分离出来,在以往的递增度上的数据流流流流中显示。最后的层能力将显示,我们通过显示一个更深层数据的模型显示一个更深层化的深度的模型的生成能力。