We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the data is generated as $f_*(\boldsymbol{x}) \propto \sum_{j=1}^{r}λ_j σ\left(\langle \boldsymbol{θ_j}, \boldsymbol{x}\rangle\right), \boldsymbol{x} \sim N(0,\boldsymbol{I}_d)$, $σ$ is the 2nd Hermite polynomial, and $\lbrace\boldsymbolθ_j \rbrace_{j=1}^{r} \subset \mathbb{R}^d$ are orthonormal signal directions. We consider the extensive-width regime $r \asymp d^β$ for $β\in [0, 1)$, and assume a power-law decay on the (non-negative) second-layer coefficients $λ_j\asymp j^{-α}$ for $α\geq 0$. We present a sharp analysis of the SGD dynamics in the feature learning regime, for both the population limit and the finite-sample (online) discretization, and derive scaling laws for the prediction risk that highlight the power-law dependencies on the optimization time, sample size, and model width. Our analysis combines a precise characterization of the associated matrix Riccati differential equation with novel matrix monotonicity arguments to establish convergence guarantees for the infinite-dimensional effective dynamics.
翻译:我们研究了在高维机制下,具有二次激活函数的两层神经网络的基于梯度的训练的优化与样本复杂度,其中数据生成方式为 $f_*(\boldsymbol{x}) \propto \sum_{j=1}^{r}λ_j σ\left(\langle \boldsymbol{θ_j}, \boldsymbol{x}\rangle\right), \boldsymbol{x} \sim N(0,\boldsymbol{I}_d)$,$σ$ 是第二类埃尔米特多项式,且 $\lbrace\boldsymbolθ_j \rbrace_{j=1}^{r} \subset \mathbb{R}^d$ 是正交的信号方向。我们考虑 $β\in [0, 1)$ 时的扩展宽度机制 $r \asymp d^β$,并假设(非负的)第二层系数 $λ_j\asymp j^{-α}$ 具有幂律衰减,其中 $α\geq 0$。我们对特征学习机制下的SGD动力学进行了精确分析,包括总体极限和有限样本(在线)离散化情况,并推导了预测风险的标度律,这些标度律凸显了其对优化时间、样本量和模型宽度的幂律依赖性。我们的分析结合了对相关矩阵Riccati微分方程的精确刻画以及新颖的矩阵单调性论证,从而为无限维有效动力学建立了收敛性保证。