驯服内存占用危机：面向生产环境的扩散大语言模型服务系统设计 (Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving)

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memory dynamics of diffusion processes in production. We identify a critical "memory footprint crisis" specific to dLLMs, driven by monolithic logit tensors and the severe resource oscillation between compute-bound "Refresh" phases and bandwidth-bound "Reuse" phases. To bridge this gap, we present dLLM-Serve, an efficient dLLM serving system that co-optimizes memory footprint, computational scheduling, and generation quality. dLLM-Serve introduces Logit-Aware Activation Budgeting to decompose transient tensor peaks, a Phase-Multiplexed Scheduler to interleave heterogeneous request phases, and Head-Centric Sparse Attention to decouple logical sparsity from physical storage. We evaluate dLLM-Serve on diverse workloads (LiveBench, Burst, OSC) and GPUs (RTX 4090, L40S). Relative to the state-of-the-art baseline, dLLM-Serve improves throughput by 1.61$\times$-1.81$\times$ on the consumer-grade RTX 4090 and 1.60$\times$-1.74$\times$ on the server-grade NVIDIA L40S, while reducing tail latency by nearly 4$\times$ under heavy contention. dLLM-Serve establishes the first blueprint for scalable dLLM inference, converting theoretical algorithmic sparsity into tangible wall-clock acceleration across heterogeneous hardware.

翻译：扩散大语言模型已成为自回归模型的一种有前景的替代方案，其利用并行解码克服了序列化瓶颈。然而，现有研究主要集中于内核级优化，缺乏一个能够应对生产环境中扩散过程独特内存动态的整体服务框架。我们识别出dLLM特有的"内存占用危机"，其根源在于单一的对数张量结构以及计算密集的"刷新"阶段与带宽密集的"重用"阶段之间剧烈的资源振荡。为填补这一空白，我们提出了dLLM-Serve——一个高效dLLM服务系统，协同优化内存占用、计算调度与生成质量。dLLM-Serve引入了对数感知激活预算机制以分解瞬态张量峰值，阶段复用调度器以交错异构请求阶段，以及头中心稀疏注意力机制以实现逻辑稀疏性与物理存储的解耦。我们在多样化工作负载（LiveBench、Burst、OSC）和GPU平台（RTX 4090、L40S）上评估dLLM-Serve。相较于最先进的基线系统，dLLM-Serve在消费级RTX 4090上实现1.61$\times$-1.81$\times$的吞吐量提升，在服务器级NVIDIA L40S上实现1.60$\times$-1.74$\times$的提升，同时在重度争用条件下将尾部延迟降低近4$\times$。dLLM-Serve为可扩展的dLLM推理建立了首个蓝图，成功将理论上的算法稀疏性转化为跨异构硬件的实际时钟加速。