Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.
翻译:高效调度大语言模型推理任务对于实现低延迟和高吞吐量至关重要,尤其是在具备推理能力的大语言模型日益普及的背景下。传统策略如先到先服务常受队头阻塞问题困扰,即长时运行任务会延迟其后排队的短任务。本文提出PARS,一种感知提示词的大语言模型任务调度器,它通过采用边界排序损失的成对排序来近似最短作业优先调度,从而提升服务效率。PARS聚焦于影响显著的调度决策,并无缝集成至前沿的大语言模型服务系统vLLM中。该系统能有效预测基于响应长度的任务排序,以最小开销降低延迟。在多种大语言模型和真实场景推理数据集上的大量实验表明,PARS显著提升了包括推理工作负载在内的各项性能。此外,我们的跨模型评估证明该设计具有良好的泛化能力,即使预测器在不同大语言模型上训练,仍能实现高效调度。