Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
翻译:扩散语言模型(DLMs)提供了一种有前景的并行生成范式,但由于需要大量细化步骤且无法使用标准的KV缓存技术,其推理速度较慢。本文提出CDLM(一致性扩散语言模型),这是一种基于训练的加速方法,可同时解决这两个瓶颈。CDLM通过集成一致性建模,实现了多令牌最终化,从而大幅减少了所需的采样步骤。此外,我们在微调过程中强制采用块级因果注意力掩码,使模型完全兼容KV缓存。实验表明,CDLM在数学和编程任务上保持竞争力的准确性的同时,实现了3.6倍至14.5倍的延迟降低。完整的训练和评估代码可在https://github.com/SqueezeAILab/CDLM获取。