There has been a recent surge of interest in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks. Most previous works assume that the training data is provided a priori in a batch, while less attention has been paid to the important setting where the training data arrives in a stream. In this paper, we study the streaming data setup and show that with overparamterization and random initialization, the prediction error of two-layer neural networks under one-pass SGD converges in expectation. The convergence rate depends on the eigen-decomposition of the integral operator associated with the so-called neural tangent kernel (NTK). A key step of our analysis is to show a random kernel function converges to the NTK with high probability using the VC dimension and McDiarmid's inequality.
翻译:最近,人们对理解高分辨神经网络中梯度下降和随机梯度梯度下降的趋同兴趣激增,大多数先前的工程假设培训数据是按分批提供,而较少注意培训数据进入流的重要环境。在本论文中,我们研究了流数据设置,并表明,由于过度分解和随机初始化,单向 SGD下双层神经网络的预测错误会合而为一。 趋同率取决于与所谓的神经核核内核(NTK)相关的整体操作器的机能分解。我们分析的关键一步是显示随机内核功能与NTK相交,极有可能使用VC维度和McDiarmid的不平等性。