The remarkable ability of deep neural networks to perfectly fit training data when optimized by gradient-based algorithms is yet to be fully explained theoretically. Explanations by recent theoretical works rely on the networks to be wider by orders of magnitude than the ones used in practice. In this work, we take a step towards closing the gap between theory and practice. We show that a randomly initialized deep neural network with ReLU activation converges to a global minimum in a logarithmic number of gradient-descent iterations, under a considerably milder condition on its width. Our analysis is based on a novel technique of training a network with fixed activation patterns. We study the unique properties of the technique that allow an improved convergence, and can be transformed at any time to an equivalent ReLU network of a reasonable size. We derive a tight finite-width Neural Tangent Kernel (NTK) equivalence, suggesting that neural networks trained with our technique generalize well at least as good as its NTK, and it can be used to study generalization as well.
翻译:深神经网络在以梯度算法优化时完全适合培训数据的惊人能力尚有待从理论上充分解释。 最近的理论工程的解释依赖于网络的广度,其规模要大于实际所使用的规模。 在这项工作中,我们迈出了缩小理论与实践之间差距的一步。 我们显示,一个随机初始化的深神经网络与RELU的激活相匹配,在一个对数的梯度-白度迭代中,在宽度相当温和的条件下,汇集到一个全球最低值。 我们的分析基于一种新颖的技术,即训练一个具有固定激活模式的网络。 我们研究能够改进趋同的这一技术的独特性,并随时可以转换成一个相当大小的RELU网络。 我们得出了一个紧凑的有限维度NEural Tangent Kernel(NTK)等值, 这表明,经过我们技术普遍化训练的神经网络,至少是良好的NTK, 并且可以用来研究一般化。