Deep neural networks' remarkable ability to correctly fit training data when optimized by gradient-based algorithms is yet to be fully understood. Recent theoretical results explain the convergence for ReLU networks that are wider than those used in practice by orders of magnitude. In this work, we take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time. We show that convergence to a global minimum is guaranteed for networks with widths quadratic in the sample size and linear in their depth at a time logarithmic in both. Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size. This construction can be viewed as a novel technique to accelerate training, while its tight finite-width equivalence to Neural Tangent Kernel (NTK) suggests it can be utilized to study generalization as well.
翻译:深神经网络在以梯度为基础的算法优化时正确匹配培训数据的非凡能力还有待充分理解。 最近的理论结果解释了ReLU网络的趋同程度,这些网络比实际使用的网络规模要大。 在这项工作中,我们迈出了一步,通过大大改进已知的网络宽度和趋同时间的理论界限,缩小理论与实践之间的差距。 我们表明,样品大小宽度和深度线宽度的网络在一个时间对数上可以保证全球最低水平的趋同程度。 我们的分析和趋同界限是通过建造一个具有固定激活模式的代理网络而得出的,这种网络可随时转换为相当规模的RELU网络。 这种构造可被视为一种新型的加速培训技术,而它与Neural Tangent Kernel(NTK)的紧紧微宽度和宽度等同度则表明它可以用来研究一般化。