We explore the ability of overparameterized shallow ReLU neural networks to learn Lipschitz, nondifferentiable, bounded functions with additive noise when trained by Gradient Descent (GD). To avoid the problem that in the presence of noise, neural networks trained to nearly zero training error are inconsistent in this class, we focus on the early-stopped GD which allows us to show consistency and optimal rates. In particular, we explore this problem from the viewpoint of the Neural Tangent Kernel (NTK) approximation of a GD-trained finite-width neural network. We show that whenever some early stopping rule is guaranteed to give an optimal rate (of excess risk) on the Hilbert space of the kernel induced by the ReLU activation function, the same rule can be used to achieve minimax optimal rate for learning on the class of considered Lipschitz functions by neural networks. We discuss several data-free and data-dependent practically appealing stopping rules that yield optimal rates.
翻译:我们探讨了过参数化浅层ReLU神经网络通过梯度下降(GD)训练学习利普希茨、不可导、有添加噪声的有界函数的能力。为避免在存在噪声的情况下,神经网络训练到接近零训练误差时在此类中不一致的问题,我们专注于早期停止的GD,使我们能够展现一致性和最优速率。特别地,我们从GD训练有限宽度神经网络引起的ReLU激活函数诱导的核的带权空间的角度探索了这个问题。我们展示了每当某些早期停止规则被保证在核诱导的ReLU激活函数上给出最优速率(超额风险),同样的规则可以被用来在神经网络上实现对所考虑的利普希茨函数的学习的极小化最优速率。我们讨论了几个无需数据和基于数据的具有实际吸引力的停止规则,这些规则产生最优速率。