We explore the ability of overparameterized shallow ReLU neural networks to learn Lipschitz, non-differentiable, bounded functions with additive noise when trained by Gradient Descent (GD). To avoid the problem that in the presence of noise, neural networks trained to nearly zero training error are inconsistent in this class, we focus on the early-stopped GD which allows us to show consistency and optimal rates. In particular, we explore this problem from the viewpoint of the Neural Tangent Kernel (NTK) approximation of a GD-trained finite-width neural network. We show that whenever some early stopping rule is guaranteed to give an optimal rate (of excess risk) on the Hilbert space of the kernel induced by the ReLU activation function, the same rule can be used to achieve minimax optimal rate for learning on the class of considered Lipschitz functions by neural networks. We discuss several data-free and data-dependent practically appealing stopping rules that yield optimal rates.
翻译:我们探索过量的浅光ReLU神经网络在接受Gradient Empround (GD)的训练时,是否有能力学习Lipschitz,这种无差别的、受绑定的功能与添加性噪音。为了避免在出现噪音的情况下,在本班中,受过训练的近乎零培训错误的神经网络存在不一致的问题,我们把重点放在早期停止GDGD上方的GDTonent Kernel(NTK)上方,以显示一致性和最佳速率。特别是,我们从GD训练的有限宽度神经网络的神经洞(NTK)近距离的角度来探讨这一问题。我们表明,只要保证某些早期停止规则能给RELU激活功能所引发的Hilbert内核空间带来最佳速率(超风险),那么同样的规则就可以用来实现神经网络在被视为Lipschitz功能的班级上学习的最小速率最佳率。我们讨论一些无数据和数据依赖数据的实际上具有吸引力的停止产生最佳速率的规则。