Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs fundamentally from the stochastic variant used in practice. Unlike SGD, stochastic Adam does not converge to its full-batch counterpart even with infinitesimal learning rates. We present the first theoretical characterization of how batch size affects Adam's generalization, analyzing two-layer over-parameterized CNNs on image data. Our results reveal that while both Adam and AdamW with proper weight decay $\lambda$ converge to poor test error solutions, their mini-batch variants can achieve near-zero test error. We further prove Adam has a strictly smaller effective weight decay bound than AdamW, theoretically explaining why Adam requires more sensitive $\lambda$ tuning. Extensive experiments validate our findings, demonstrating the critical role of batch size and weight decay in Adam's generalization performance.
翻译:Adam是深度学习中一种流行且广泛使用的自适应梯度方法,在理论研究中也受到极大关注。然而,现有理论工作主要分析其全批量版本,这与实际使用的随机变体存在本质差异。与SGD不同,即使学习率无限小,随机Adam也不会收敛到其全批量对应版本。我们首次从理论上刻画了批量大小如何影响Adam的泛化性能,分析了图像数据上两层过参数化CNN的情况。研究结果表明,尽管具有适当权重衰减$\lambda$的Adam和AdamW都会收敛到较差的测试误差解,但它们的迷你批量变体能够实现接近零的测试误差。我们进一步证明Adam具有比AdamW更严格的有效权重衰减界限,从理论上解释了为何Adam需要更敏感的$\lambda$调参。大量实验验证了我们的发现,证明了批量大小和权重衰减在Adam泛化性能中的关键作用。