正弦激活函数在模加法学习中的可证明优势 (Provable Benefits of Sinusoidal Activation for Modular Addition)

This paper studies the role of activation functions in learning modular addition with two-layer neural networks. We first establish a sharp expressivity gap: sine MLPs admit width-$2$ exact realizations for any fixed length $m$ and, with bias, width-$2$ exact realizations uniformly over all lengths. In contrast, the width of ReLU networks must scale linearly with $m$ to interpolate, and they cannot simultaneously fit two lengths with different residues modulo $p$. We then provide a novel Natarajan-dimension generalization bound for sine networks, yielding nearly optimal sample complexity $\widetilde{\mathcal{O}}(p)$ for ERM over constant-width sine networks. We also derive width-independent, margin-based generalization for sine networks in the overparametrized regime and validate it. Empirically, sine networks generalize consistently better than ReLU networks across regimes and exhibit strong length extrapolation.

翻译：本文研究激活函数在双层神经网络学习模加法中的作用。我们首先建立了一个显著的表达能力差距：正弦激活的多层感知器（MLP）对于任意固定长度$m$，仅需宽度$2$即可实现精确表示；若引入偏置项，则宽度$2$的网络可统一适用于所有长度。相比之下，ReLU网络必须使宽度随$m$线性增长才能实现插值，且无法同时拟合模$p$下具有不同余数的两种长度。随后，我们为正弦网络提出了一种新颖的Natarajan维泛化界，在常数宽度正弦网络上通过经验风险最小化（ERM）获得了近乎最优的样本复杂度$\widetilde{\mathcal{O}}(p)$。我们还在过参数化机制中推导出与宽度无关的、基于间隔的正弦网络泛化理论，并进行了验证。实验表明，正弦网络在不同机制下均比ReLU网络表现出更一致的泛化能力，并展现出强大的长度外推性。