Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings where training involves repeated passes over limited data and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. Finally, we explain why diffusion models excel in this regime: their randomized masking objective implicitly trains over a rich distribution of token orderings, acting as an implicit data augmentation that AR's fixed left-to-right factorization lacks. Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.
翻译:自回归模型长期以来主导着大规模语言模型的发展格局,推动了各类任务的进步。近年来,基于扩散的语言模型作为一种有前景的替代方案出现,但其相对于自回归模型的优势尚未得到充分探索。本文系统研究了在数据受限场景下的掩码扩散模型,其中训练过程涉及对有限数据的重复遍历,并发现当计算资源充足但数据稀缺时,扩散模型显著优于自回归模型。扩散模型能更有效地利用重复数据,实现更低的验证损失和更优的下游性能。我们提出了扩散模型的新缩放规律,并推导出扩散模型开始超越自回归模型的临界计算阈值的闭式表达式。最后,我们解释了扩散模型在此机制下表现出色的原因:其随机掩码目标隐式地训练了丰富的词元顺序分布,起到了自回归模型固定的从左到右因子化所缺乏的隐式数据增强作用。我们的结果表明,当数据而非计算成为瓶颈时,扩散模型为标准自回归范式提供了具有竞争力的替代方案。代码发布于:https://diffusion-scaling.github.io。