完全优化变压器推断:一项调查</s> (Full Stack Optimization of Transformer Inference: a Survey)

Sehoon Kim,Coleman Hooper,Thanakul Wattanawong,Minwoo Kang,Ruohan Yan,Hasan Genc,Grace Dinh,Qijing Huang,Kurt Keutzer,Michael W. Mahoney,Yakun Sophia Shao,Amir Gholami

Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemmini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemmini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference.

翻译：最新工艺型DNN架构设计的最新进展一直向变压器模型的方向发展。这些模型在广泛的应用中实现了更高的准确性。自最初引入变压器模型以来,这一趋势在过去几年中一直保持。然而,最近变压器模型推导所需的计算量和带宽量正在以显著的速度增长,这使得其在对衬里敏感的应用中的部署具有挑战性。因此,人们越来越重视提高变压器模型的效率,其方法从改变结构设计到开发专用的域特定加速器。在这项工作中,我们调查了高效变压器推断的不同方法,包括:(一)对现有变压器结构中的瓶颈及其与以前的变压模型的相似和差异进行分析和剖析;(二)变压器结构对硬件的影响,包括层正常化、软模和GELU等非线性操作的影响,以及直线式操作对硬件设计的影响。 (三)优化固定变压器结构的各种方法;(四)对现有变压器结构进行对比,最后采用变压式结构的变压式模型,通过对前变压式结构进行演算。</s>