Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method supports existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.
翻译:从大型自动递减模型(如变换器)的推论是缓慢的,解码 K 质记用K 序列运行模式。 在这项工作中,我们引入了投机解码法 — 一种从自动递减模型中提取样本的算法,无需对输出结果作任何改动,通过平行计算数个符号来更快地从自动递减模型中提取样本。 我们的方法核心在于观察:(1) 硬语言建模任务通常包括较简单的子任务,这些子任务可以通过更高效的模型加以比较近似,(2) 使用投机性执行和新颖的取样方法,我们可以更快地从大型模型中精确解码, 把它们与近似模型的输出平行运行, 可能同时生成数个符号, 并且不改变分布。 我们的方法支持现有的现成模型, 而没有再培训或结构改变。 我们在T5-XXL上展示了它, 并展示了与标准 T5X 执行过程相比的2X-3X加速度, 和相同的输出。