We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the data is generated as $y \propto \sum_{j=1}^{r}λ_j σ\left(\langle \boldsymbol{θ_j}, \boldsymbol{x}\rangle\right), \boldsymbol{x} \sim N(0,\boldsymbol{I}_d)$, $σ$ is the 2nd Hermite polynomial, and $\lbrace\boldsymbolθ_j \rbrace_{j=1}^{r} \subset \mathbb{R}^d$ are orthonormal signal directions. We consider the extensive-width regime $r \asymp d^β$ for $β\in [0, 1)$, and assume a power-law decay on the (non-negative) second-layer coefficients $λ_j\asymp j^{-α}$ for $α\geq 0$. We present a sharp analysis of the SGD dynamics in the feature learning regime, for both the population limit and the finite-sample (online) discretization, and derive scaling laws for the prediction risk that highlight the power-law dependencies on the optimization time, sample size, and model width. Our analysis combines a precise characterization of the associated matrix Riccati differential equation with novel matrix monotonicity arguments to establish convergence guarantees for the infinite-dimensional effective dynamics.
翻译:本文研究高维情形下具有二次激活函数的两层神经网络的梯度训练优化与样本复杂度,其中数据生成方式为 $y \propto \sum_{j=1}^{r}λ_j σ\left(\langle \boldsymbol{θ_j}, \boldsymbol{x}\rangle\right), \boldsymbol{x} \sim N(0,\boldsymbol{I}_d)$,$σ$ 为二阶埃尔米特多项式,且 $\lbrace\boldsymbolθ_j \rbrace_{j=1}^{r} \subset \mathbb{R}^d$ 为一组正交信号方向。我们考虑 $β\in [0, 1)$ 时的宽扩展机制 $r \asymp d^β$,并假设第二层(非负)系数满足幂律衰减 $λ_j\asymp j^{-α}$($α\geq 0$)。我们对特征学习机制中的随机梯度下降动力学进行了精确分析,涵盖总体极限与有限样本(在线)离散化情形,并推导了预测风险的标度律,揭示了其与优化时间、样本量和模型宽度之间的幂律依赖关系。我们的分析结合了相关矩阵里卡蒂微分方程的精确刻画与新颖的矩阵单调性论证,从而为无限维有效动力学建立了收敛性保证。