Transformer-based models have made tremendous impacts in natural language generation. However the inference speed is a bottleneck due to large model size and intensive computing involved in auto-regressive decoding process. We develop FastSeq framework to accelerate sequence generation without accuracy loss. The proposed optimization techniques include an attention cache optimization, an efficient algorithm for detecting repeated n-grams, and an asynchronous generation pipeline with parallel I/O. These optimizations are general enough to be applicable to Transformer-based models (e.g., T5, GPT2, and UniLM). Our benchmark results on a set of widely used and diverse models demonstrate 4-9x inference speed gain. Additionally, FastSeq is easy to use with a simple one-line code change. The source code is available at https://github.com/microsoft/fastseq.
翻译:以变换器为基础的模型在自然语言生成中产生了巨大影响。然而,由于在自动递减解码过程中涉及大型模型规模和密集计算,推导速度是一个瓶颈。我们开发了快速Seq框架,以加速序列生成,而不造成准确损失。拟议的优化技术包括关注缓存优化、探测重复的正克的有效算法和与平行的I/O的不同步生成管道。这些优化十分笼统,足以适用于以变换器为基础的模型(例如,T5、GPT2和UniLM)。我们对一套广泛使用的不同模型的基准结果显示4-9x推导速度增速。此外,快速Seq在简单单行代码修改后很容易使用。源代码可在https://github.com/microsoft/fastseq查阅。