Recent developments in applications of artificial neural networks with over $n=10^{14}$ parameters make it extremely important to study the large $n$ behaviour of such networks. Most works studying wide neural networks have focused on the infinite width $n \to +\infty$ limit of such networks and have shown that, at initialization, they correspond to Gaussian processes. In this work we will study their behavior for large, but finite $n$. Our main contributions are the following: (1) The computation of the corrections to Gaussianity in terms of an asymptotic series in $n^{-\frac{1}{2}}$. The coefficients in this expansion are determined by the statistics of parameter initialization and by the activation function. (2) Controlling the evolution of the outputs of finite width $n$ networks, during training, by computing deviations from the limiting infinite width case (in which the network evolves through a linear flow). This improves previous estimates and yields sharper decay rates for the (finite width) NTK in terms of $n$, valid during the entire training procedure. As a corollary, we also prove that, with arbitrarily high probability, the training of sufficiently wide neural networks converges to a global minimum of the corresponding quadratic loss function. (3) Estimating how the deviations from Gaussianity evolve with training in terms of $n$. In particular, using a certain metric in the space of measures we find that, along training, the resulting measure is within $n^{-\frac{1}{2}}(\log n)^{1+}$ of the time dependent Gaussian process corresponding to the infinite width network (which is explicitly given by precomposing the initial Gaussian process with the linear flow corresponding to training in the infinite width limit).
翻译:宽神经网络:从非高斯随机场的初始化到训练中的NTK几何
翻译摘要:
最近在具有超过$n=10^{14}$个参数的人工神经网络应用中的发展使得研究这种网络的大规模$n$行为变得极其重要。大多数研究宽神经网络的工作都集中在分析此类网络的无限宽度$n \to +\infty$极限,并且表明在初始化时,它们对应于高斯过程。在本工作中,我们将研究有限大但大规模$n$网络的行为。我们的主要贡献如下:(1)使用$n^{-\frac{1}{2}}$的渐近级数计算高斯性修正。此级数中的系数由参数初始化和激活函数的统计信息决定。(2)通过在有限宽度$n$网络的输出演化中计算偏差,以控制它们的训练过程,同时确定有限宽度下NTK的衰减速率,改善以前的估计,该速率与$n$有关,并在整个训练过程中有效。作为推论,我们还证明,对于足够宽的神经网络,它们的训练以任意高的概率收敛到对应的二次损失函数的全局最小值。(3)估计非高斯性的偏差如何随时间变化而根据$n$变化,在度量空间中找到结果的度量,特别是我们发现,沿着训练的路径,所得到的度量在时间依赖高斯过程的$n^{-\frac{1}{2}}(\log n)^{1+}$精度范围内,该高斯过程对应无限宽网络的演化。