Neural Tangent Kernel (NTK) theory is widely used to study the dynamics of infinitely-wide deep neural networks (DNNs) under gradient descent. But do the results for infinitely-wide networks give us hints about the behavior of real finite-width ones? In this paper, we study empirically when NTK theory is valid in practice for fully-connected ReLU and sigmoid DNNs. We find out that whether a network is in the NTK regime depends on the hyperparameters of random initialization and the network's depth. In particular, NTK theory does not explain the behavior of sufficiently deep networks initialized so that their gradients explode as they propagate through the network's layers: the kernel is random at initialization and changes significantly during training in this case, contrary to NTK theory. On the other hand, in the case of vanishing gradients, DNNs are in the the NTK regime but become untrainable rapidly with depth. We also describe a framework to study generalization properties of DNNs, in particular the variance of network's output function, by means of NTK theory and discuss its limits.
翻译:NTK 理论被广泛用于研究无穷无尽深层神经网络(DNN)在梯度下下降的动态。 但无穷无尽的网络结果是否给我们关于真正有限宽度网络行为的提示? 在本文中,当NTK理论在完全连通的ReLU和Sigmoid DNNS的实践中有效时,我们从经验上研究NTK理论。我们发现,一个网络是否在NTK系统中,取决于随机初始化的超参数和网络深度。特别是,NTK理论没有解释足够深的初始化网络的行为,因此其梯度在通过网络层传播时会爆炸:在初始化时,在培训期间,与NTK理论相反,核心是随机的,发生重大变化。另一方面,在消失梯度方面,DNNNS处于NT制度之中,但变得不易深度。我们还描述了一个框架,用来研究DNNS的一般特性,特别是网络输出功能的差异,通过理论和网络输出极限的手段来讨论。