A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning many classes of functions including sparse polynomials. Recent works have thus aimed to identify settings where gradient based algorithms provably generalize better than the NTK. One such example is the "QuadNTK" approach of Bai and Lee (2020), which analyzes the second-order term in the Taylor expansion. Bai and Lee (2020) show that the second-order term can learn sparse polynomials efficiently; however, it sacrifices the ability to learn general dense polynomials. In this paper, we analyze how gradient descent on a two-layer neural network can escape the NTK regime by utilizing a spectral characterization of the NTK (Montanari and Zhong, 2020) and building on the QuadNTK approach. We first expand upon the spectral analysis to identify "good" directions in parameter space in which we can move without harming generalization. Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own. Finally, we construct a regularizer which encourages our parameter vector to move in the "good" directions, and show that gradient descent on the regularized loss will converge to a global minimizer, which also has low test error. This yields an end to end convergence and generalization guarantee with provable sample complexity improvement over both the NTK and QuadNTK on their own.
翻译:深层次学习理论中最近的一个目标是确定神经网络如何摆脱“ 低度培训 ” 或 Neal Tangent Kernel (NTK) 的“ 低度培训 ” 制度。 其中一个例子是Bai 和 Lee (2020) 的“ QadNTK ” 方法, 分析泰勒 扩展的第二阶期 。 Bai 和 Lee (202020) 显示, 第二阶期可以有效地学习稀薄的多元数学(Ghorbani et al, 2021), 因而对于学习包括稀薄的多元数学在内的许多功能来说, 其抽样复杂性很低。 因此, 最近的工作旨在找出基于梯度的运算法比NTK 系统更普遍普及的设置。 其中一个例子是Bai和 Lee (2020) 的“ QadNTK ” 方法, 分析第二阶梯级的第二阶梯期, 也可以在我们Oral Ration Ration 上显示一个正常的运算。