In this note, we first derive a one-parameter family of hyperparameter scaling strategies that interpolates between the neural-tangent scaling and mean-field/maximal-update scaling. We then calculate the scalings of dynamical observables -- network outputs, neural tangent kernels, and differentials of neural tangent kernels -- for wide and deep neural networks. These calculations in turn reveal a proper way to scale depth with width such that resultant large-scale models maintain their representation-learning ability. Finally, we observe that various infinite-width limits examined in the literature correspond to the distinct corners of the interconnected web spanned by effective theories for finite-width neural networks, with their training dynamics ranging from being weakly-coupled to being strongly-coupled.
翻译:在本说明中, 我们首先得出一个单数的超参数缩放策略, 将神经- 断层缩放和平均/ 场/ 最大缩放相互交错。 然后我们计算动态可观察到量的缩放 -- -- 网络输出、 神经相向内核和神经相向内核差异 -- -- 用于宽度和深度的神经网络。 这些计算结果又揭示出一种适当的方法, 以宽度扩大深度, 从而导致大型模型保持其代表性学习能力。 最后, 我们观察到, 文献中研究的各种无限宽度限制与由有限线神经网络的有效理论所覆盖的互联网络不同角落相对, 其培训动力从弱的相交错到强相交错。