Recent work by Jacot et al. (2018) has shown that training a neural network of any kind with gradient descent in parameter space is strongly related to kernel gradient descent in function space with respect to the Neural Tangent Kernel (NTK). Lee et al. (2019) built on this result by establishing that the output of a neural network trained using gradient descent can be approximated by a linear model for wide networks. In parallel, a recent line of studies (Schoenholz et al. 2017; Hayou et al. 2019) has suggested that a special initialization, known as the Edge of Chaos, improves training. In this paper, we bridge the gap between these two concepts by quantifying the impact of the initialization and the activation function on the NTK when the network depth becomes large. In particular, we show that the performance of wide deep neural networks cannot be explained by the NTK regime and we provide experiments illustrating our theoretical results.
翻译:Jacot等人(2018年)最近的工作表明,在参数空间中,对具有梯度下行的任何类型的神经网络进行培训,这与神经唐氏内核(NTK)在功能空间中的内核梯度下行密切相关。 Lee等人(2019年)以这一结果为基础,确定使用梯度下行进行训练的神经网络的产出可以用宽网络的线性模型进行近似。与此同时,最近的一行研究(Schoenholz等人,2017年;Hayou等人,2019年)表明,称为Chaos Edge的专门初始化可以改进培训。在本文件中,我们通过量化初始化和启动功能在网络深度大时对NTK的影响来弥合这两个概念之间的差距。我们特别表明,广深神经网络的性能无法用NTK制度来解释,我们提供实验来说明我们的理论结果。