Large transformer models display promising performance on a wide range of natural language processing (NLP) tasks. Although the AI community has expanded the model scale to the trillion parameter level, the practical deployment of 10-100 billion parameter models is still uncertain due to the latency, throughput, and memory constraints. In this paper, we proposed EnergonAI to solve the challenges of the efficient deployment of 10-100 billion parameter transformer models on single- or multi-GPU systems. EnergonAI adopts a hierarchy-controller system architecture to coordinate multiple devices and efficiently support different parallel patterns. It delegates the execution of sub-models to multiple workers in the single-controller style and applies tensor parallelism and pipeline parallelism among the workers in a multi-controller style. Upon the novel architecture, we propose three techniques, i.e. non-blocking pipeline parallelism, distributed redundant computation elimination, and peer memory pooling. EnergonAI enables the users to program complex parallel code the same as a serial one. Compared with the FasterTransformer, we have proven that EnergonAI has superior performance on latency and throughput. In our experiments, EnergonAI can achieve 37% latency reduction in tensor parallelism, 10% scalability improvement in pipeline parallelism, and it improves the model scale inferred on a single GPU by using a larger heterogeneous memory space at cost of limited performance reduction.
翻译:大型变压器模型在一系列广泛的自然语言处理(NLP)任务中表现出很有希望的性能。尽管AI社区已将模型规模扩大到万亿参数水平,但实际部署10千亿参数模型仍然不确定,因为存在延迟、输送量和内存限制。在本文件中,我们建议EnergonAI解决在单一或多GPU系统中高效部署10千亿参数变压器模型的挑战。EnergonAI采用一个等级控制器系统架构来协调多个装置并有效支持不同的平行模式。它代表了在单一控制器风格下对多个工人实施子模型,并在多控制器风格下对工人应用了多平行和平行模式。在新的结构中,我们提出了三种技术,即无阻管道平行、分散冗余计算和同行记忆集合。EnergonagonAI使用户能够像一个序列一样编程复杂的平行代码。与“快速变换”相比,我们已证明EnergongonatAI在弹性和超载性操作和超载性工作方面表现了在多控控控器模式和超载模式模式模式上,在更大程度的GRO性实验中改进了GRO- 递增缩性实验中,在10 % 递减压中,在10号上可以实现递增压的递增压。