The excellent real-world performance of deep neural networks has received increasing attention. Despite the capacity to overfit significantly, such large models work better than smaller ones. This phenomenon is often referred to as the scaling law by practitioners. It is of fundamental interest to study why the scaling law exists and how it avoids/controls overfitting. One approach has been looking at infinite width limits of neural networks (e.g., Neural Tangent Kernels, Gaussian Processes); however, in practise, these do not fully explain finite networks as their infinite counterparts do not learn features. Furthermore, the empirical kernel for finite networks (i.e., the inner product of feature vectors), changes significantly during training in contrast to infinite width networks. In this work we derive an iterative linearised training method. We justify iterative lineralisation as an interpolation between finite analogs of the infinite width regime, which do not learn features, and standard gradient descent training which does. We show some preliminary results where iterative linearised training works well, noting in particular how much feature learning is required to achieve comparable performance. We also provide novel insights into the training behaviour of neural networks.
翻译:深神经网络在现实世界中的出色表现日益受到越来越多的关注。尽管这些大型模型具有显著的超额适应能力,但这类大型模型比较小的模型效果更好。这种现象通常被称为从业人员的缩放法。研究为什么存在按比例调整的法律,以及它如何避免/控制过度调整,具有根本意义。一种做法是研究神经网络的无限宽度限制(例如,神经唐氏内核、高西进程);然而,在实践中,这些并不完全解释有限的网络,因为它们无穷无尽的对应方没有学习特点。此外,有限网络的经验内核(即地貌矢量器的内部产品)在培训期间发生了重大变化,与无限宽度网络形成对比。我们在此工作中得出了一个反复的线性培训方法。我们有理由将迭代线性线性线性作为无限宽度系统(不学习特征)的有限类比和标准梯度下层培训之间的一种相互交错的内推法。我们展示了一些初步结果,在迭线性培训中取得了良好的效果,我们特别注意到需要多少特征学习才能取得可比较的绩效。我们还对神经网络的行为提供新的洞见。