在数据受限场景下扩散模型优于自回归模型 (Diffusion Beats Autoregressive in Data-Constrained Settings)

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings where training involves repeated passes over limited data and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. Finally, we explain why diffusion models excel in this regime: their randomized masking objective implicitly trains over a rich distribution of token orderings, acting as an implicit data augmentation that AR's fixed left-to-right factorization lacks. Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.

翻译：自回归模型长期以来主导着大规模语言模型的发展格局，推动了各类任务的进步。近年来，基于扩散的语言模型作为一种有前景的替代方案出现，但其相对于自回归模型的优势尚未得到充分探索。本文系统研究了在数据受限场景下的掩码扩散模型，其中训练过程涉及对有限数据的重复遍历，并发现当计算资源充足但数据稀缺时，扩散模型显著优于自回归模型。扩散模型能更有效地利用重复数据，实现更低的验证损失和更优的下游性能。我们提出了扩散模型的新缩放规律，并推导出扩散模型开始超越自回归模型的临界计算阈值的闭式表达式。最后，我们解释了扩散模型在此机制下表现出色的原因：其随机掩码目标隐式地训练了丰富的词元顺序分布，起到了自回归模型固定的从左到右因子化所缺乏的隐式数据增强作用。我们的结果表明，当数据而非计算成为瓶颈时，扩散模型为标准自回归范式提供了具有竞争力的替代方案。代码发布于：https://diffusion-scaling.github.io。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日