A candidate explanation of the good empirical performance of deep neural networks is the implicit regularization effect of first order optimization methods. Inspired by this, we prove a convergence theorem for nonconvex composite optimization, and apply it to a general learning problem covering many machine learning applications, including supervised learning. We then present a deep multilayer perceptron model and prove that, when sufficiently wide, it $(i)$ leads to the convergence of gradient descent to a global optimum with a linear rate, $(ii)$ benefits from the implicit regularization effect of gradient descent, $(iii)$ is subject to novel bounds on the generalization error, $(iv)$ exhibits the lazy training phenomenon and $(v)$ enjoys learning rate transfer across different widths. The corresponding coefficients, such as the convergence rate, improve as width is further increased, and depend on the even order moments of the data generating distribution up to an order depending on the number of layers. The only non-mild assumption we make is the concentration of the smallest eigenvalue of the neural tangent kernel at initialization away from zero, which has been shown to hold for a number of less general models in contemporary works. We present empirical evidence supporting this assumption as well as our theoretical claims.
翻译:深神经网络的良好实证表现的候选解释是第一顺序优化方法的隐含正规化效果。 由此,我们证明是非colvex复合优化的趋同理论,并将其应用于包括许多机器学习应用程序在内的一般学习问题,包括监督学习。 然后我们提出一个深层多层透视模型,并证明如果足够宽度,它(一)美元会导致梯度下降与直线率的全球最佳水平趋同,(二)美元从梯度下降的隐含正规化效果中受益,美元(三)美元在一般化错误上受到新颖界限的约束,美元(四)显示懒惰的培训现象,美元(五)美元在不同宽度上进行学习率转移。相应的系数,如趋同率随着宽度的进一步提高而得到改善,取决于数据发布分配的偶数,取决于层数。 我们唯一不考虑的假设是零初始化时最小的神经骨内核精精精度的集中度,从零点起,这是用来支持当代一般模型的假设。