This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.
翻译:这项工作研究通过神经正切内核( NTK) 系统中的梯度下降来训练一个隐藏层过分的 ReLU 网络。 与先前的工程不同, 网络的偏差是可以训练的, 并初始化为某种恒定的, 而不是零。 这项工作的第一组结果是网络的梯度下降动态的趋同特征。 令人惊讶的是, 已经显示, 封闭后的网络可以和原始网络一样快速趋同。 与先前的工作相比, 不仅允许在我们设置下, 梯度下降来更新偏差, 而且还要进行更细的分析, 以确保网络接近NTK所需的偏差宽度得到改进。 第二, 培训后网络的宽度依赖性表现为网络的渐趋近性。 宽度依赖性显示, 偏差性局部拉德马彻的复杂性和概括性与先前的分析( 至对逻辑因素) 相匹配。 作为副产品, 如果选择偏差初始化为零, 则宽度要求不会改进, 细度要求使前一列的梯值比最低值更精确, 因此, 浅级的铁路的偏差性分析会导致浅点 最低值 最低值 。