Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Current SD implementations use a fixed speculative length, failing to adapt to dynamic request rates and creating a significant performance bottleneck in real-world serving scenarios. To overcome this, we propose Nightjar, a novel learning-based algorithm for adaptive speculative inference that adjusts to request load by dynamically selecting the optimal speculative length for different batch sizes and even disabling speculative decoding when it provides no benefit. Experiments show that Nightjar achieves up to 14.8% higher throughput and 20.2% lower latency compared to standard speculative decoding, demonstrating robust efficiency for real-time serving.
翻译:推测解码通过并行验证草稿词元来加速大语言模型推理。然而,该方法存在一个关键权衡:在低负载、内存受限的系统中能提升吞吐量,但在高负载、计算受限的环境中,由于验证开销反而会降低性能。现有推测解码实现采用固定的推测长度,无法适应动态请求速率,在实际服务场景中造成显著性能瓶颈。为解决此问题,我们提出夜鹰,一种基于学习的自适应推测推理算法,该算法通过动态选择不同批次大小下的最优推测长度,甚至能在推测解码无益时将其禁用,从而适应请求负载。实验表明,与标准推测解码相比,夜鹰最高可实现14.8%的吞吐量提升和20.2%的延迟降低,展现了面向实时服务的鲁棒高效性。