深层或深层推断:在未先例的尺度上使变形模型能够有效推断 (DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale)

Reza Yazdani Aminabadi,Samyam Rajbhandari,Minjia Zhang,Ammar Ahmad Awan,Cheng Li,Du Li,Elton Zheng,Jeff Rasley,Shaden Smith,Olatunji Ruwase,Yuxiong He

The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts; the target application scenarios can be latency-critical or throughput-oriented; the deployment hardware could be single- or multi-GPU systems with different types of memory and storage, etc. With such increasing diversity and the fast-evolving pace of transformer models, designing a highly performant and efficient inference system is extremely challenging. In this paper, we present DeepSpeed Inference, a comprehensive system solution for transformer model inference to address the above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU inference solution to minimize latency while maximizing the throughput of both dense and sparse transformer models when they fit in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. Moreover, it enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50\%$ of A6000 peak).

翻译：过去几年中,基于变压器的模型取得了成功,其规模和应用设想方案继续快速增长。当前的变压器模型景观日益多样化:模型规模变化巨大,最大值为100亿参数;模型特征不同,因为Mixture-Experts引入的偏狭性;目标应用设想方案可以是延缓临界或吞吐导向型;部署硬件可以是单一或多GPU系统,具有不同类型的内存和存储等。随着变压模型的日益多样化和快速变化速度,设计高性能和高效的比值推力系统也日益多样化。在本论文中,我们介绍了深Spea Inferations,这是变压模型用来应对上述挑战的全面系统解决方案。深层推力包括:(1) 多GPU的推力解决方案,以最大限度地降低调压和稀释型变压型模型的通量,同时在综合GPU记忆中,通过高度平流率和高度平流率的平流度模型,通过GVI值下调高度调低。