深神经网络中学习的反常扩散动态 (Anomalous diffusion dynamics of learning in deep neural networks)

Learning in deep neural networks (DNNs) is implemented through minimizing a highly non-convex loss function, typically by a stochastic gradient descent (SGD) method. This learning process can effectively find good wide minima without being trapped in poor local ones. We present a novel account of how such effective deep learning emerges through the interactions of the SGD and the geometrical structure of the loss landscape. Rather than being a normal diffusion process (i.e. Brownian motion) as often assumed, we find that the SGD exhibits rich, complex dynamics when navigating through the loss landscape; initially, the SGD exhibits anomalous superdiffusion, which attenuates gradually and changes to subdiffusion at long times when the solution is reached. Such learning dynamics happen ubiquitously in different DNNs such as ResNet and VGG-like networks and are insensitive to batch size and learning rate. The anomalous superdiffusion process during the initial learning phase indicates that the motion of SGD along the loss landscape possesses intermittent, big jumps; this non-equilibrium property enables the SGD to escape from sharp local minima. By adapting the methods developed for studying energy landscapes in complex physical systems, we find that such superdiffusive learning dynamics are due to the interactions of the SGD and the fractal-like structure of the loss landscape. We further develop a simple model to demonstrate the mechanistic role of the fractal loss landscape in enabling the SGD to effectively find global minima. Our results thus reveal the effectiveness of deep learning from a novel perspective and have implications for designing efficient deep neural networks.

翻译：深神经网络(DNNS) 的学习是通过尽量减少高度非convex损失功能来实施的,通常采用随机梯度梯度下降法。这种学习过程可以有效地找到良好的宽度迷你,而不会被困在贫穷的当地网络中。我们展示了这样一个有效的深度学习是如何通过SGD和损失地貌的几何结构的互动而出现的。而不是经常假设的正常的传播过程( 即 Brownian 动作 ), 我们发现 SGD 在通过损失地貌时表现出丰富而复杂的动态; 首先, SGD 展示了异常地貌的超度; 在达成解决方案的很长的时间里,这种超度的超度会逐渐消退缩和变异。这种学习动态在不同的 DNNGS,例如ResNet 和 VGGT 类似的网络里, 并且对批量的体积和学习速度不敏感。在最初的学习阶段, SGDD运动运动运动运动的深度运动会发现,在损失地貌景观上发现, 的深度运动运动的深度运动和运动的精度上, 正在有效地展示我们从的精细的深度变变变变的系统。