深度超宽神经网络的非线性随机矩阵的变形半圆律和集中性 (Deformed semicircle law and concentration of nonlinear random matrices for ultra-wide neural networks)

In this paper, we investigate a two-layer fully connected neural network of the form $f(X)=\frac{1}{\sqrt{d_1}}\boldsymbol{a}^\top \sigma\left(WX\right)$, where $X\in\mathbb{R}^{d_0\times n}$ is a deterministic data matrix, $W\in\mathbb{R}^{d_1\times d_0}$ and $\boldsymbol{a}\in\mathbb{R}^{d_1}$ are random Gaussian weights, and $\sigma$ is a nonlinear activation function. We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$: the empirical conjugate kernel (CK) and neural tangent kernel (NTK), beyond the linear-width regime ($d_1\asymp n$). We focus on the $\textit{ultra-wide regime}$, where the width $d_1$ of the first layer is much larger than the sample size $n$. Under appropriate assumptions on $X$ and $\sigma$, a deformed semicircle law emerges as $d_1/n\to\infty$ and $n\to\infty$. We first prove this limiting law for generalized sample covariance matrices with some dependency. To specify it for our neural network model, we provide a nonlinear Hanson-Wright inequality that is suitable for neural networks with random weights and Lipschitz activation functions. We also demonstrate non-asymptotic concentrations of the empirical CK and NTK around their limiting kernels in the spectral norm, along with lower bounds on their smallest eigenvalues. As an application, we show that random feature regression induced by the empirical kernel achieves the same asymptotic performance as its limiting kernel regression under the ultra-wide regime. This allows us to calculate the asymptotic training and test errors for random feature regression using the corresponding kernel regression.

翻译：本文研究形式为$f(X)=\frac{1}{\sqrt{d_1}}\boldsymbol{a}^\top \sigma\left(WX\right)$的两层全连接神经网络，其中$X\in\mathbb{R}^{d_0\times n}$是一个确定性数据矩阵，$W\in\mathbb{R}^{d_1\times d_0}$，$\boldsymbol{a}\in\mathbb{R}^{d_1}$是随机高斯权重，$\sigma$是一个非线性激活函数。我们研究与$f(X)$相关的两个经验核矩阵的极限谱分布：经验共轭核(CK)和神经切线核(NTK)，超出线性宽度区域($d_1\asymp n$)。我们关注$\textit{超宽区域}$，即第一层的宽度$d_1$远大于样本大小$n$。在$X$和$\sigma$的适当假设下，在$d_1/n\to\infty$和$n\to\infty$时，会出现一个变形的半圆律。我们首先证明了这个极限定律对于具有一定相关性的一般样本协方差矩阵是成立的。为了将其指定为我们的神经网络模型，我们提供了一个适用于具有随机权重和Lipschitz激活函数的神经网络的非线性Hanson-Wright不等式。我们还展示了经验CK和NTK在其极限核的谱范数中的非渐近集中性，以及它们最小特征值的下限。作为应用，我们展示了经验核引起的随机特征回归在超宽区域下实现与其极限核回归相同的渐近性能。这允许我们使用相应的核回归计算随机特征回归的渐近训练和测试误差。