We present DiffusionBERT, a new generative masked language model based on discrete diffusion models. Diffusion models and many pre-trained language models have a shared training objective, i.e., denoising, making it possible to combine the two powerful models and enjoy the best of both worlds. On the one hand, diffusion models offer a promising training strategy that helps improve the generation quality. On the other hand, pre-trained denoising language models (e.g., BERT) can be used as a good initialization that accelerates convergence. We explore training BERT to learn the reverse process of a discrete diffusion process with an absorbing state and elucidate several designs to improve it. First, we propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step based on the information of each token. Second, we investigate several designs of incorporating the time step into BERT. Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text (e.g., D3PM and Diffusion-LM) and previous generative masked language models in terms of perplexity and BLEU score.
翻译:我们介绍了基于离散扩散模型的新型基因化隐形语言模型,即扩散模型和许多经过预先培训的语言模型,有一个共同的培训目标,即:拆除,使两个强大的模型合二为一,享受两个世界的最佳利益。一方面,传播模型提供了有希望的培训战略,有助于提高发电质量。另一方面,预先培训的解除隐形语言模型(如BERT)可以作为一种良好的初始化,加速融合。我们探索培训BERT,以学习吸收状态的离散扩散进程的反向进程,并阐明改进这一进程的若干设计。首先,我们为前方传播进程提出一个新的噪音时间表,以控制根据每个象征的信息在每一步骤上增加的噪音程度。第二,我们调查将时间步骤纳入BERT的若干设计。关于无条件生成文本的实验表明,DiflBERT在每分级和BLM级中的现有传播模型和以前的基因化化化的隐形语言模型取得了显著的改进。