The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the RNN models, transformers are able to process on dimensions of sequence lengths in parallel, therefore leads to better accuracy on long sequences. However, efficient deployments of them for online services in data centers equipped with GPUs are not easy. First, more computation introduced by transformer structures makes it more challenging to meet the latency and throughput constraints of serving. Second, NLP tasks take in sentences of variable length. The variability of input dimensions brings a severe problem to efficient memory management and serving optimization. To solve the above challenges, this paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework. Three innovative features make it stand out from other similar works. An efficient parallel algorithm is proposed for GPU-based batch reduction operations, like Softmax and LayerNorm, which are major hot spots besides BLAS routines. A memory allocation algorithm, which better balances the memory footprint and allocation/free efficiency, is designed for variable-length input situations. A serving framework equipped with a new batching scheduler using dynamic programming achieves the optimal throughput on variable-length requests. The system can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into your PyTorch code with a few lines of code.
翻译:变压器是近年来自然语言处理(NLP)领域最关键的算法创新。 与 RNN 模型不同, 变压器能够同时处理序列长度的维度, 从而导致长序列的更准确性。 但是, 在配备 GPU 的数据中心, 高效地部署这些变压器用于在线服务并不容易。 首先, 变压器结构引入更多的计算方法, 更难以满足服务时间和吞吐量的限制。 第二, NLP 任务以不同长度的句号为主。 输入维度的变异性给高效的记忆管理和优化服务带来严重问题。 为了解决上述挑战, 本文设计了一个变压器服务系统, 名为 TurboTransexters, 由计算运行时间和服务框架组成。 三个创新特性使得它与其他类似的工程相隔绝。 为基于 GPPPPPPS常规常规的批量减量操作提出了高效的平行算法。 一个配置框架, 通过一个配置的系统化流程, 能够通过一个配置的系统升级的流程, 实现一个配置的系统压变压变压平台。