Yang (2020a) recently showed that the Neural Tangent Kernel (NTK) at initialization has an infinite-width limit for a large class of architectures including modern staples such as ResNet and Transformers. However, their analysis does not apply to training. Here, we show the same neural networks (in the so-called NTK parametrization) during training follow a kernel gradient descent dynamics in function space, where the kernel is the infinite-width NTK. This completes the proof of the *architectural universality* of NTK behavior. To achieve this result, we apply the Tensor Programs technique: Write the entire SGD dynamics inside a Tensor Program and analyze it via the Master Theorem. To facilitate this proof, we develop a graphical notation for Tensor Programs.
翻译:Yang(2020a)最近显示,在初始化时的Neal Tangent Kernel(NTK)对包括ResNet和变异器等现代主食在内的一大批建筑具有无限宽限。 然而,它们的分析并不适用于培训。 在这里,我们在培训期间展示同样的神经网络(在所谓的NTK 半径化中) 遵循在功能空间的内核梯度下降动态, 内核是NTK行为的无限宽度。 这完整地证明了NTK行为的“ 建筑构造普遍性” 。 为了实现这一结果, 我们应用了“ Tensor 程序” 技术: 在Tensor 程序内写入整个 SGD 动态, 并通过“ 主理论” 分析它。 为了便利这个证据, 我们为Tensor 程序开发了一个图形标记 。