通过本地电磁存储差异等同模拟深层学习动态 (Imitating Deep Learning Dynamics via Locally Elastic Stochastic Differential Equations)

Understanding the training dynamics of deep learning models is perhaps a necessary step toward demystifying the effectiveness of these models. In particular, how do data from different classes gradually become separable in their feature spaces when training neural networks using stochastic gradient descent? In this study, we model the evolution of features during deep learning training using a set of stochastic differential equations (SDEs) that each corresponds to a training sample. As a crucial ingredient in our modeling strategy, each SDE contains a drift term that reflects the impact of backpropagation at an input on the features of all samples. Our main finding uncovers a sharp phase transition phenomenon regarding the {intra-class impact: if the SDEs are locally elastic in the sense that the impact is more significant on samples from the same class as the input, the features of the training data become linearly separable, meaning vanishing training loss; otherwise, the features are not separable, regardless of how long the training time is. Moreover, in the presence of local elasticity, an analysis of our SDEs shows that the emergence of a simple geometric structure called the neural collapse of the features. Taken together, our results shed light on the decisive role of local elasticity in the training dynamics of neural networks. We corroborate our theoretical analysis with experiments on a synthesized dataset of geometric shapes and CIFAR-10.

翻译：深层学习模型的培训动态或许是理解这些模型有效性神秘化的必要步骤。特别是, 当使用随机梯度梯度下降来培训神经网络时, 不同阶级的数据如何逐渐在其特征空间中分离? 在本研究中, 我们用一套与培训样本相对应的深层学习培训差异方程式(SDEs)来模拟深层培训培训培训过程中的特征演变。作为我们模型战略的一个关键要素, 每个SDE都包含一个漂移的术语,它反映了对所有样本特征的输入进行反演的影响。我们的主要发现揭示出一个与 {异级影响有关的尖锐阶段过渡现象: 如果 SDEs 具有本地弹性, 其影响对于来自同一班级的样本来说更为显著, 培训数据的特征变得线性相可变性, 意味着培训损失的消失; 否则, 各种特征是不可分离的, 不论培训时间有多长。此外, 在存在地方弹性的情况下, 我们的SDEs的主要发现, 揭示了一个尖锐的阶段性过渡性变化现象: 如果SDEs是局部的理论结构结构的形成, 也就是我们地球结构结构结构结构的崩溃。