There is a recent and growing literature on large-width asymptotic properties of Gaussian neural networks (NNs), namely NNs whose weights are initialized as Gaussian distributions. Two popular problems are: i) the study of the large-width distributions of NNs, which characterizes the infinitely wide limit of a rescaled NN in terms of a Gaussian stochastic process; ii) the study of the large-width training dynamics of NNs, which characterizes the infinitely wide dynamics in terms of a deterministic kernel, referred to as the neural tangent kernel (NTK), and shows that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. In this paper, we consider these problems for $\alpha$-Stable NNs, namely NNs whose weights are initialized as $\alpha$-Stable distributions with $\alpha\in(0,2]$. First, for $\alpha$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $\alpha$-Stable stochastic process. As a difference with respect to the Gaussian setting, our result shows that the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $\alpha$-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Then, we study the large-width training dynamics of $\alpha$-Stable ReLU-NNs, characterizing the infinitely wide dynamics in terms of a random kernel, referred to as the $\alpha$-Stable NTK, and showing that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. The randomness of the $\alpha$-Stable NTK is a further difference with respect to the Gaussian setting, that is: within the $\alpha$-Stable setting, the randomness of the NN at initialization does not vanish in the large-width regime of the training.
翻译:有关高斯神经网络(即以高斯分布方式初始化重量的NNS) 的广度偏差性学文献最近不断增长,即高斯神经网络(即以高斯分布方式初始化重量的NNS) 的广度偏差性学。有两个流行的问题是:一是研究非斯分布的宽度分布,这是重度净值的无限限制,即高斯剖面神经过程;二是研究非斯偏差的粗度培训动态:以美元表示的确定性内极低,以美元表示的确定性变异性动态,以纳色内显示的是纳色内纳色(NTK),以足够宽度表示的降色度进程,以纳色度表示的纳色度。