The ultra-large-scale pre-training model can effectively improve the effect of a variety of tasks, and it also brings a heavy computational burden to inference. This paper introduces a series of ultra-large-scale pre-training model optimization methods that combine algorithm characteristics and GPU processor hardware characteristics, and on this basis, propose an inference engine -- Easy and Efficient Transformer (EET), Which has a significant performance improvement over the existing schemes. We firstly introduce a pre-padding decoding mechanism that improves token parallelism for generation tasks. Then we design high optimized kernels to remove sequence masks and achieve cost-free calculation for padding tokens, as well as support long sequence and long embedding sizes. Thirdly a user-friendly inference system with an easy service pipeline was introduced which greatly reduces the difficulty of engineering deployment with high throughput. Compared to Faster Transformer's implementation for GPT-2 on A100, EET achieves a 1.5-15x state-of-art speedup varying with context length.EET is available https://github.com/NetEase-FuXi/EET.
翻译:超大型培训前模式可以有效地改善各种任务的效果,同时也带来沉重的计算负担。本文介绍一系列超大型培训前模式优化方法,结合算法特性和GPU处理器硬件特性,在此基础上提出推论引擎 -- -- 简单高效的变异器(EET),该变异器比现有计划有显著的性能改进。我们首先引入了编程前解码机制,改进了代办任务的象征性平行性。然后,我们设计了高优化的内核,以去除序列面罩,实现划线标牌的无成本计算,以及支持长序和长嵌入尺寸。第三,采用了方便用户的推导系统,该系统可大大降低高载率工程部署的难度。与A100GPT-2相比, EET实现了1.5-15x的快速速度,时间长度不同。 EET可提供https://github.com/Netase-FuXi/ETET。