We analyze the dynamics of finite width effects in wide but finite feature learning neural networks. Unlike many prior analyses, our results, while perturbative in width, are non-perturbative in the strength of feature learning. Starting from a dynamical mean field theory (DMFT) description of infinite width deep neural network kernel and prediction dynamics, we provide a characterization of the $\mathcal{O}(1/\sqrt{\text{width}})$ fluctuations of the DMFT order parameters over random initialization of the network weights. In the lazy limit of network training, all kernels are random but static in time and the prediction variance has a universal form. However, in the rich, feature learning regime, the fluctuations of the kernels and predictions are dynamically coupled with variance that can be computed self-consistently. In two layer networks, we show how feature learning can dynamically reduce the variance of the final NTK and final network predictions. We also show how initialization variance can slow down online learning in wide but finite networks. In deeper networks, kernel variance can dramatically accumulate through subsequent layers at large feature learning strengths, but feature learning continues to improve the SNR of the feature kernels. In discrete time, we demonstrate that large learning rate phenomena such as edge of stability effects can be well captured by infinite width dynamics and that initialization variance can decrease dynamically. For CNNs trained on CIFAR-10, we empirically find significant corrections to both the bias and variance of network dynamics due to finite width.
翻译:本文分析了宽且有限特征学习神经网络中有限宽度效应的动力学。与许多以前的分析不同,我们的结果在宽度上是微扰的,而在特征学习的强度上是非微扰的。我们从动态平均场理论(DMFT)描述无限宽深度神经网络核和预测动态出发,提供了DMFT序参量在网络权重的随机初始化上的$\mathcal{O}(1/\sqrt{\text{width}})$波动的表征。在网络训练的惰性极限下,所有的核都是随机的,但是在富有的特征学习的范围内,核和预测的波动是动态耦合的,方差可以自洽地计算。在双层网络中,我们展示了特征学习如何动态地减少最终NTK和最终网络预测的方差。我们还展示了初始化方差如何减缓宽而有限的网络的在线学习。在深层网络中,核的方差可以在大的特征学习强度下通过随后的层显著积累,但特征学习继续提高特征核的信噪比。在离散时间中,我们展示了无限宽度动力学可以很好地捕捉学习率较大的现象,如稳定边缘效应,同时初始化方差可以动态减小。对于在CIFAR-10上进行训练的CNN,我们经验证实了由于有限宽度导致的网络动态的偏差和方差的显著修正。