Self-attention-based transformer models have achieved tremendous success in the domain of natural language processing. Despite their efficacy, accelerating the transformer is challenging due to its quadratic computational complexity and large activation sizes. Existing transformer accelerators attempt to prune its tokens to reduce memory access, albeit with high compute overheads. Moreover, previous works directly operate on large matrices involved in the attention operation, which limits hardware utilization. In order to address these challenges, this work proposes a novel dynamic inference scheme, DynaTran, which prunes activations at runtime with low overhead, substantially reducing the number of ineffectual operations. This improves the throughput of transformer inference. We further propose tiling the matrices in transformer operations along with diverse dataflows to improve data reuse, thus enabling higher energy efficiency. To effectively implement these methods, we propose AccelTran, a novel accelerator architecture for transformers. Extensive experiments with different models and benchmarks demonstrate that DynaTran achieves higher accuracy than the state-of-the-art top-k hardware-aware pruning strategy while attaining up to 1.2$\times$ higher sparsity. One of our proposed accelerators, AccelTran-Edge, achieves 330K$\times$ higher throughput with 93K$\times$ lower energy requirement when compared to a Raspberry Pi device. On the other hand, AccelTran-Server achieves 5.73$\times$ higher throughput and 3.69$\times$ lower energy consumption compared to the state-of-the-art transformer co-processor, Energon. The simulation source code is available at https://github.com/jha-lab/acceltran.
翻译:基于自注意力机制的Transformer模型在自然语言处理领域取得了巨大成功。由于其二次计算复杂度和大型激活尺寸,加速Transformer是具有挑战性的。现有的Transformer加速器尝试减少内存访问,尽管存在高计算开销。此外,以往的工作直接在注意力操作中涉及大型矩阵,这限制了硬件利用率。为解决这些挑战,本研究提出了一种新的动态推理方案DynaTran,在运行时修剪激活,具有较低的开销,从而大大减少了无效操作的数量,提高了Transformer推理的吞吐量。我们进一步提出了沿着Transformer操作瓦片的矩阵的多样化数据流来提高数据重用,从而实现更高的能量效率。为了有效实现这些方法,我们提出了一种新颖的Transformer加速器体系结构AccelTran。通过不同模型和基准测试的广泛实验,证明DynaTran在达到高达1.2倍更高的稀疏度时,比最先进的硬件感知修剪策略获得了更高的精度。我们提出的加速器之一,AccelTran-Edge,与树莓派设备相比,实现了高达330K倍的吞吐量和93K倍的能量需求降低。另一方面,与最先进的Transformer协处理器Energon相比,AccelTran-Server实现了5.73倍的吞吐量提高和3.69倍的能量消耗降低。模拟源代码可在https://github.com/jha-lab/acceltran获得。