Large Language Model (LLM) inference has emerged as a fundamental paradigm. In real-world scenarios, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose ARES, an adaptive decoding rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with : A dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 74.77% and achieving up to 2.24 times higher goodput.
翻译:大语言模型(LLM)推理已成为一种基础范式。在实际场景中,输出长度的变化会导致解码阶段出现严重的工作负载不均衡,尤其对于长输出推理任务。现有系统(如PD解耦架构)依赖静态的预填充-解码调度策略,这在动态变化的解码工作负载下常导致服务等级目标(SLO)违规和内存溢出(OOM)故障。本文提出ARES,一种基于长度预测的自适应解码重调度系统,能够预测未来工作负载。我们的核心贡献包括:(1)一种轻量级、持续运行的LLM原生预测方法,利用LLM隐藏状态对剩余生成长度进行高精度建模(平均绝对误差降低49.42%),且开销极低(预测器参数量减少93.28%);(2)解码阶段的重调度解决方案:通过整合当前与预测工作负载的动态均衡机制,将P99每次输出时间(TPOT)降低74.77%,并将优质吞吐量最高提升2.24倍。