Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the prolonged turnaround time of Transformer models is a widely recognized roadblock. The variety of sequence lengths imposes additional computing overhead where inputs need to be zero-padded to the maximum sentence length in the batch to accommodate the parallel computing platforms. This paper targets the field-programmable gate array (FPGA) and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. Particularly, we develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. The proposed sparse attention operator brings the complexity of attention-based models down to linear complexity and alleviates the off-chip memory traffic. The proposed length-aware resource hardware scheduling algorithm dynamically allocates the hardware resources to fill up the pipeline slots and eliminates bubbles for NLP tasks. Experiments show that our design has very small accuracy loss and has 80.2 $\times$ and 2.6 $\times$ speedup compared to CPU and GPU implementation, and 4 $\times$ higher energy efficiency than state-of-the-art GPU accelerator optimized via CUBLAS GEMM.
翻译:自2018年以来,变异器被认为是最重要的深层次学习模式之一,部分原因是它建立了最先进的(SOTA)记录,并有可能取代现有的深神经网络(DNNs)。尽管取得了显著的胜利,但变异器模型的长期周转时间是一个得到广泛承认的路障。序列长度的繁琐要求额外的计算管理费用,投入必须零加到批量的最大刑期,以容纳平行的计算平台。本文针对的是外地可编程门阵列(FPGA),并提议为变异器加速而采用一致序列长度的适应性算法硬件共同设计。特别是,我们开发了一个方便硬件的分散关注操作器和长度的硬件资源调度算法。拟议的分散注意操作器将关注模型的复杂性降低到线性复杂度,并减轻离子存储器内存储器的流量。拟议的长感知资源硬件调度法动态地分配硬件资源,以填补输油槽,消除NLPP任务的泡沫。实验显示,我们的设计损失非常小,而且比CUPO$和GGPOPER高速度的80.2美元和GPO美元比G美元和GGPUPR4美元。