We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.
翻译:我们提出投机性抽样,这是加速变压器解码的算法,它能够从每个变压器呼叫中生成多个符号。 我们的算法依赖于这样的观察,即通过一个更快但不太强大的草稿模型生成的短连续的平行评分,与从较大目标模型中取样单一符号的长度相当。这与新颖的经修改的拒绝采样计划相结合,该计划将目标模型的分布保留在硬件数字中。我们把投机采样与一个700亿参数语言模型Chinchilla作为基准,在一个分布式装置中实现2-2.5x解码速度,同时不损害样本质量或对模型本身进行修改。