通过投机代号从变形器中快速推断 (Fast Inference from Transformers via Speculative Decoding)

Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method supports existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.

翻译：从大型自动递减模型(如变换器)的推论是缓慢的,解码 K 质记用K 序列运行模式。在这项工作中,我们引入了投机解码法 — 一种从自动递减模型中提取样本的算法,无需对输出结果作任何改动,通过平行计算数个符号来更快地从自动递减模型中提取样本。我们的方法核心在于观察:(1) 硬语言建模任务通常包括较简单的子任务,这些子任务可以通过更高效的模型加以比较近似,(2) 使用投机性执行和新颖的取样方法,我们可以更快地从大型模型中精确解码, 把它们与近似模型的输出平行运行, 可能同时生成数个符号, 并且不改变分布。我们的方法支持现有的现成模型, 而没有再培训或结构改变。我们在T5-XXL上展示了它, 并展示了与标准 T5X 执行过程相比的2X-3X加速度, 和相同的输出。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日