Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. FlexLLM's static compilation optimizations -- dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM maintains inference SLO compliance at up to 20 req/s, and improves finetuning throughput by $1.9-4.8\times$ under heavy inference workloads and $2.5-6.8\times$ under light loads, preserving over 76% of peak finetuning progress even at peak demand. FlexLLM is publicly available at https://flexllm.github.io.
翻译:微调大语言模型对于任务适应至关重要,然而当前的服务栈将推理和微调隔离在独立的GPU集群上——这浪费了资源且硬件利用率不足。我们提出了FlexLLM,这是首个通过在令牌级别融合计算,在共享GPU上协同服务LLM推理和基于参数高效微调(PEFT)的微调的系统。FlexLLM的静态编译优化——依赖并行化和图剪枝——显著减少了激活内存,使得端到端GPU内存节省高达80%。在运行时,一种新颖的令牌级微调机制与混合令牌调度器配对,在每个协同服务迭代中动态地交错处理推理和训练令牌,在满足严格延迟服务等级目标的同时最大化利用率。在LLaMA-3.1-8B、Qwen-2.5-14B和Qwen-2.5-32B上的端到端基准测试中,FlexLLM在高达20 req/s的请求率下保持推理SLO合规性,并在重度推理负载下将微调吞吐量提高$1.9-4.8\times$,在轻负载下提高$2.5-6.8\times$,即使在峰值需求下也能保留超过76%的峰值微调进度。FlexLLM已在 https://flexllm.github.io 公开提供。