Neural network (NN) training and generalization in the infinite-width limit are well-characterized by kernel methods with a neural tangent kernel (NTK) that is stationary in time. However, finite-width NNs consistently outperform corresponding kernel methods, suggesting the importance of feature learning, which manifests as the time evolution of NTKs. Here, we analyze the phenomenon of kernel alignment of the NTK with the target functions during gradient descent. We first provide a mechanistic explanation for why alignment between task and kernel occurs in deep linear networks. We then show that this behavior occurs more generally if one optimizes the feature map over time to accelerate learning while constraining how quickly the features evolve. Empirically, gradient descent undergoes a feature learning phase, during which top eigenfunctions of the NTK quickly align with the target function and the loss decreases faster than power law in time; it then enters a kernel gradient descent (KGD) phase where the alignment does not improve significantly and the training loss decreases in power law. We show that feature evolution is faster and more dramatic in deeper networks. We also found that networks with multiple output nodes develop separate, specialized kernels for each output channel, a phenomenon we termed kernel specialization. We show that this class-specific alignment is does not occur in linear networks.
翻译:神经网络(NNN) 培训和无限宽度限制的神经网络(NN) 培训和概括性被内核方法以内核方法很好地定性, 内核内核与深线网络中的任务和内核之间为何会发生对齐。 然而, 有限宽度 NNN 总是比相应的内核方法更优于相应的内核方法, 这表明特性学习的重要性, 这表现为 NTK 的时间演进。 在这里, 我们分析NTK 的内核与梯度下坡期间目标函数之间的内核调整现象。 我们首先提供一个机械化的解释, 为什么任务和内核之间的对齐会在深线网络中发生。 我们然后显示, 更普遍的行为是, 随着时间的推移, 渐渐渐渐的下降会发生特性学习阶段, 在此期间, NTK 的顶部功能功能会迅速与目标功能相匹配, 损失会比权力法更迅速减少。 然后进入内核梯度的梯级(KGD) 阶段, 调整不会显著改善任务, 培训网络会逐渐减少。 我们显示, 分级的网络会显示, 我们的特性进化过程会更快。