Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.
翻译:推测解码通过并行生成和验证多个令牌来改进大语言模型推理,但现有系统因动态推测与静态运行时假设不匹配而导致性能欠佳。我们提出了Yggdrasil,一个协同设计的系统,通过上下文感知的树状草稿生成和编译器友好的执行,实现了延迟最优的推测解码。Yggdrasil引入了用于静态图兼容的等增长树结构、用于草稿选择的延迟感知优化目标,以及基于阶段的调度以减少开销。Yggdrasil支持未经修改的大语言模型,并在多种硬件设置下相比最先进的基线实现了高达$3.98\times$的加速。