基于提示感知的低延迟大语言模型服务调度 (Prompt-Aware Scheduling for Low-Latency LLM Serving)

Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.

翻译：高效调度大语言模型推理任务对于实现低延迟和高吞吐量至关重要，尤其是在具备推理能力的大语言模型日益普及的背景下。传统策略如先到先服务常遭受队头阻塞问题，即长时运行的任务会延迟排在后面的短任务。本文提出PARS，一种基于提示感知的大语言模型任务调度器，它通过采用边际排序损失的成对排序来近似最短作业优先调度，从而提升服务效率。PARS专注于影响显著的调度决策，并可无缝集成到先进的大语言模型服务系统vLLM中。它能有效预测基于响应长度的任务排序，以最小开销降低延迟。在多种大语言模型和真实世界推理数据集上的大量实验表明，PARS显著提升了性能，包括对推理工作负载。此外，我们的跨模型评估证明该设计具有良好的泛化能力，即使预测器在不同大语言模型上训练，也能实现有效调度。