We provide quantitative bounds measuring the $L^2$ difference in function space between the trajectory of a finite-width network trained on finitely many samples from the idealized kernel dynamics of infinite width and infinite data. An implication of the bounds is that the network is biased to learn the top eigenfunctions of the Neural Tangent Kernel not just on the training set but over the entire input space. This bias depends on the model architecture and input distribution alone and thus does not depend on the target function which does not need to be in the RKHS of the kernel. The result is valid for deep architectures with fully connected, convolutional, and residual layers. Furthermore the width does not need to grow polynomially with the number of samples in order to obtain high probability bounds up to a stopping time. The proof exploits the low-effective-rank property of the Fisher Information Matrix at initialization, which implies a low effective dimension of the model (far smaller than the number of parameters). We conclude that local capacity control from the low effective rank of the Fisher Information Matrix is still underexplored theoretically.
翻译:我们提供了数量界限,以测量从无限宽度和无限数据的理想内核动态中有限的许多样本中受训的有限宽度网络轨道和无限数据之间在功能空间上的差值。这些界限的含义是,网络偏重于不仅在训练组中,而且在整个输入空间中学习神经唐氏内核的顶部功能。这种偏差仅取决于模型结构和输入分布本身,因此并不取决于不需要在内核的RKHS中的目标功能。其结果对完全相连、脉动和剩余层的深层结构是有效的。此外,宽度不需要与样品数量成倍增长,以获得高概率,直至停止时间。证据利用了初始化时渔业信息矩阵低效的属性,这意味着模型的有效层面较低(远远小于参数数目)。我们的结论是,从渔业信息矩阵低级中获取的本地能力控制在理论上仍然不足。