This paper studies the role of activation functions in learning modular addition with two-layer neural networks. We first establish a sharp expressivity gap: sine MLPs admit width-$2$ exact realizations for any fixed length $m$ and, with bias, width-$2$ exact realizations uniformly over all lengths. In contrast, the width of ReLU networks must scale linearly with $m$ to interpolate, and they cannot simultaneously fit two lengths with different residues modulo $p$. We then provide a novel Natarajan-dimension generalization bound for sine networks, yielding nearly optimal sample complexity $\widetilde{\mathcal{O}}(p)$ for ERM over constant-width sine networks. We also derive width-independent, margin-based generalization for sine networks in the overparametrized regime and validate it. Empirically, sine networks generalize consistently better than ReLU networks across regimes and exhibit strong length extrapolation.
翻译:本文研究激活函数在双层神经网络学习模加法中的作用。我们首先建立了一个显著的表达能力差距:正弦激活的多层感知器(MLP)对于任意固定长度$m$,仅需宽度$2$即可实现精确表示;若引入偏置项,则宽度$2$的网络可统一适用于所有长度。相比之下,ReLU网络必须使宽度随$m$线性增长才能实现插值,且无法同时拟合模$p$下具有不同余数的两种长度。随后,我们为正弦网络提出了一种新颖的Natarajan维泛化界,在常数宽度正弦网络上通过经验风险最小化(ERM)获得了近乎最优的样本复杂度$\widetilde{\mathcal{O}}(p)$。我们还在过参数化机制中推导出与宽度无关的、基于间隔的正弦网络泛化理论,并进行了验证。实验表明,正弦网络在不同机制下均比ReLU网络表现出更一致的泛化能力,并展现出强大的长度外推性。