Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency overhead exacerbating the speed-accuracy tradeoff. Prior methods (Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling. We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from early-exit signals in parallel with the target model's suffix and explicitly maps computation across heterogeneous accelerators (GPU and NPU) to exploit cross-device parallelism. The draft speculates forward continuations for the target to verify, while the target simultaneously speculates correction paths for the draft, converting speculation into two complementary execution pipelines. To further cut draft latency without weakening acceptance semantics, we add speculative streaming so the draft emits multiple tokens per step. This dual strategy of parallel heterogeneous execution plus multi-token speculative streaming pushes speculative decoding toward its ideal regime of high acceptance with low overhead. On SpecBench with server-scale models from 14B to 66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving 2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative improvement over the strongest baseline, EAGLE3.
翻译:推测解码通过使用草稿模型进行前瞻来加速大语言模型推理,但其收益受限于自回归草稿生成的成本:增加草稿规模会提高接受率,但会引入额外的延迟开销,加剧速度与准确性的权衡。先前的方法(Medusa、Hydra、EAGLE)部分降低了草稿成本,但要么降低了接受率,要么引入了限制扩展的开销。我们提出了镜像推测解码,这是一种打破延迟-接受权衡的推理算法。镜像推测解码从目标模型后缀的早期退出信号并行启动分支完整展开,并显式地将计算映射到异构加速器(GPU和NPU)上以利用跨设备并行性。草稿推测目标模型需要验证的前向延续,而目标模型同时推测草稿的修正路径,将推测转化为两个互补的执行流水线。为了在不削弱接受语义的前提下进一步降低草稿延迟,我们增加了推测流式处理,使草稿每步生成多个令牌。这种并行异构执行加上多令牌推测流式处理的双重策略,将推测解码推向高接受率与低开销的理想状态。在SpecBench上使用14B至66B参数的服务器规模模型进行测试,镜像推测解码在不同任务中实现了2.8倍至5.8倍的端到端加速,相对于最强基线EAGLE3平均相对提升了30%。