Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components--model, diffusion iteration manager, decoding strategy, and KV-cache manager--and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers a $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared to the AR model (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with the latest vLLM inference engine, dInfer still delivers a $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.
翻译:基于扩散的大语言模型(dLLMs)已成为自回归(AR)大语言模型的一种有前景的替代方案,其利用基于去噪的生成过程实现了固有的并行性。尽管越来越多的开源dLLM模型涌现,但由于缺乏标准化且高效的推理框架,其广泛应用仍然受限。本文提出了dInfer,一个面向dLLM推理的高效且可扩展的框架。dInfer将推理流程分解为四个模块化组件——模型、扩散迭代管理器、解码策略和KV缓存管理器——并为每个组件集成了新颖的算法以及系统级优化。通过这种算法创新与系统增强的结合,dInfer在LLaDA-MoE上实现了显著的效率提升,且未牺牲输出质量。在批大小为1的情况下,其在HumanEval基准上每秒生成超过1,100个token,并在$8\times$ H800 GPU上,于六个基准测试中平均每秒生成超过800个token。与现有系统相比,dInfer在保持相似模型性能的同时,相比Fast-dLLM实现了$10\times$的加速。即使与经过最新vLLM推理引擎高度优化、激活参数量和性能相当的AR模型QWen2.5-3B相比,dInfer仍能提供$2$-$3\times$的加速。dInfer的实现已在 https://github.com/inclusionAI/dInfer 开源。