FlexLLM：具备服务等级目标保证的大语言模型推理与微调令牌级协同服务系统 (FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees)

Gabriele Oliaro,Xupeng Miao,Xinhao Cheng,Vineeth Kada,Mengdi Wu,Ruohan Gao,Yingyi Huang,Remi Delacourt,April Yang,Yingcheng Wang,Colin Unger,Zhihao Jia

from arxiv, NSDI 2026

Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. FlexLLM's static compilation optimizations -- dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM maintains inference SLO compliance at up to 20 req/s, and improves finetuning throughput by $1.9-4.8\times$ under heavy inference workloads and $2.5-6.8\times$ under light loads, preserving over 76% of peak finetuning progress even at peak demand. FlexLLM is publicly available at https://flexllm.github.io.

翻译：微调大语言模型对于任务适应至关重要，然而当前的服务栈将推理和微调隔离在独立的GPU集群上——这浪费了资源且硬件利用率不足。我们提出了FlexLLM，这是首个通过在令牌级别融合计算，在共享GPU上协同服务LLM推理和基于参数高效微调（PEFT）的微调的系统。FlexLLM的静态编译优化——依赖并行化和图剪枝——显著减少了激活内存，使得端到端GPU内存节省高达80%。在运行时，一种新颖的令牌级微调机制与混合令牌调度器配对，在每个协同服务迭代中动态地交错处理推理和训练令牌，在满足严格延迟服务等级目标的同时最大化利用率。在LLaMA-3.1-8B、Qwen-2.5-14B和Qwen-2.5-32B上的端到端基准测试中，FlexLLM在高达20 req/s的请求率下保持推理SLO合规性，并在重度推理负载下将微调吞吐量提高$1.9-4.8\times$，在轻负载下提高$2.5-6.8\times$，即使在峰值需求下也能保留超过76%的峰值微调进度。FlexLLM已在 https://flexllm.github.io 公开提供。