Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of $O(1/\sqrt{n})$, or focus on networks with smooth activation functions, incurring exponential dependence on network depth $L$. In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin $γ$, we prove an excess risk rate of $\widetilde{O}(L^4 (1 + γL^2) / (n γ^2))$, which aligns with the optimal SVM-type rate $\widetilde{O}(1 / (n γ^2))$ up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.
翻译:近期研究显著深化了我们对深度神经网络中梯度下降(GD)方法泛化性能的理解。一个自然而根本的问题是:GD能否达到与核方法中已建立的最小最大最优速率相当的泛化速率?现有结果要么给出次优的$O(1/\sqrt{n})$速率,要么聚焦于具有平滑激活函数的网络,导致对网络深度$L$的指数依赖。在本工作中,我们通过精细权衡优化误差与泛化误差,为深度ReLU网络的GD方法建立了最优泛化速率,仅对深度具有多项式依赖。具体而言,在数据从间隔$γ$处NTK可分的假设下,我们证明了$\widetilde{O}(L^4 (1 + γL^2) / (n γ^2))$的过剩风险速率,该速率与最优SVM型速率$\widetilde{O}(1 / (n γ^2))$在深度相关因子范围内保持一致。一个关键的技术贡献是我们对参考模型附近激活模式的新颖控制,从而为梯度下降训练的深度ReLU网络提供了更尖锐的Rademacher复杂度界。