Increasing the batch size during training -- a ''batch ramp'' -- is a promising strategy to accelerate large language model pretraining. While for SGD, doubling the batch size can be equivalent to halving the learning rate, the optimal strategy for adaptive optimizers like Adam is less clear. As a result, any batch-ramp scheduling, if used at all, is typically tuned heuristically. This work develops a principled framework for batch-size scheduling and introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/\sqrt{2}$ and doubles the batch size, preserving loss dynamics while reducing serial steps. Theoretically, we provide, to our knowledge, the first finite-sample proof of equivalence between learning-rate decay and batch-size ramp-up for SGD on noisy linear regression, and we extend this equivalence to normalized SGD, a tractable proxy for Adam, under a variance-dominated regime observed in practice. Empirically, on 150M/300M/600M-parameter models trained at Chinchilla scale using a constant (critical) batch size, Seesaw matches cosine decay at equal FLOPs while reducing wall-clock time by $\approx 36\%$, approaching the theoretical limit implied by our analysis.
翻译:在训练过程中增加批次大小——即“批次斜坡”——是一种加速大型语言模型预训练的有效策略。对于随机梯度下降(SGD)而言,将批次大小加倍可等效于将学习率减半,但对于Adam等自适应优化器,其最优策略尚不明确。因此,任何批次斜坡调度(若被使用)通常仅通过启发式方法进行调整。本研究建立了一个基于原理的批次大小调度框架,并提出了跷跷板(Seesaw)方法:当标准调度器会将学习率减半时,跷跷板方法将其乘以$1/\sqrt{2}$并同时将批次大小加倍,从而在保持损失动态特性的同时减少串行训练步数。理论上,我们首次为噪声线性回归场景下的SGD提供了学习率衰减与批次大小增加之间等价的有限样本证明,并将该等价性扩展到归一化SGD(作为Adam的可处理代理)在实践观察到的方差主导机制下成立。实证方面,在使用恒定(临界)批次大小训练Chinchilla规模的1.5亿/3亿/6亿参数模型时,跷跷板方法在相同浮点运算次数下与余弦衰减性能相当,同时将实际训练时间减少约36%,接近理论分析所暗示的极限值。