dInfer：一种面向扩散语言模型的高效推理框架 (dInfer: An Efficient Inference Framework for Diffusion Language Models)

Yuxin Ma,Lun Du,Lanning Wei,Kun Chen,Qian Xu,Kangyu Wang,Guofeng Feng,Guoshan Lu,Lin Liu,Xiaojing Qi,Xinyuan Zhang,Zhen Tao,Haibo Feng,Ziyun Jiang,Ying Xu,Zenan Huang,Yihong Zhuang,Haokai Xu,Jiaqi Hu,Zhenzhong Lan,Junbo Zhao,Jianguo Li,Da Zheng

Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components--model, diffusion iteration manager, decoding strategy, and KV-cache manager--and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers a $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared to the AR model (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with the latest vLLM inference engine, dInfer still delivers a $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.

翻译：基于扩散的大语言模型（dLLMs）已成为自回归（AR）大语言模型的一种有前景的替代方案，其利用基于去噪的生成过程实现了固有的并行性。尽管越来越多的开源dLLM模型涌现，但由于缺乏标准化且高效的推理框架，其广泛应用仍然受限。本文提出了dInfer，一个面向dLLM推理的高效且可扩展的框架。dInfer将推理流程分解为四个模块化组件——模型、扩散迭代管理器、解码策略和KV缓存管理器——并为每个组件集成了新颖的算法以及系统级优化。通过这种算法创新与系统增强的结合，dInfer在LLaDA-MoE上实现了显著的效率提升，且未牺牲输出质量。在批大小为1的情况下，其在HumanEval基准上每秒生成超过1,100个token，并在$8\times$ H800 GPU上，于六个基准测试中平均每秒生成超过800个token。与现有系统相比，dInfer在保持相似模型性能的同时，相比Fast-dLLM实现了$10\times$的加速。即使与经过最新vLLM推理引擎高度优化、激活参数量和性能相当的AR模型QWen2.5-3B相比，dInfer仍能提供$2$-$3\times$的加速。dInfer的实现已在 https://github.com/inclusionAI/dInfer 开源。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日