The main advantages of diffusion language models over autoregressive (AR) models lie in their ability to support parallel generation and bidirectional attention, enabling a more controllable generation process. In recent years, open-source mask diffusion language models have emerged, most of which are based on a variant known as absorbing diffusion. However, this paper demonstrates why mask diffusion faces inherent difficulties in achieving parallel generation and bidirectional attention. We also propose the most effective training and inference strategies for mask diffusion.
翻译:扩散语言模型相较于自回归模型的主要优势在于其支持并行生成和双向注意力机制,从而能够实现更可控的生成过程。近年来,开源掩码扩散语言模型逐渐兴起,其中多数基于一种称为吸收扩散的变体。然而,本文论证了掩码扩散在实现并行生成和双向注意力方面存在固有困难。我们同时提出了针对掩码扩散最有效的训练与推理策略。