Transformers, composed of QKV generation, attention computation, and FFNs, have become the dominant model across various domains due to their outstanding performance. However, their high computational cost hinders efficient hardware deployment. Sparsity offers a promising solution, yet most existing accelerators exploit only intra-row sparsity in attention, while few consider inter-row sparsity. Approaches leveraging inter-row sparsity often rely on costly global similarity estimation, which diminishes the acceleration benefits of sparsity, and typically apply sparsity to only one or two transformer components. Through careful analysis of the attention distribution and computation flow, we observe that local similarity allows end-to-end sparse acceleration with lower computational overhead. Motivated by this observation, we propose ESACT, an end-to-end sparse accelerator for compute-intensive Transformers. ESACT centers on the Sparsity Prediction with Local Similarity (SPLS) mechanism, which leverages HLog quantization to accurately predict local attention sparsity prior to QK generation, achieving efficient sparsity across all transformer components. To support efficient hardware realization, we introduce three architectural innovations. Experimental results on 26 benchmarks demonstrate that SPLS reduces total computation by 52.03% with less than 1% accuracy loss. ESACT achieves an end-to-end energy efficiency of 3.29 TOPS/W, and improves attention-level energy efficiency by 2.95x and 2.26x over SOTA attention accelerators SpAtten and Sanger, respectively.
翻译:由QKV生成、注意力计算和前馈网络组成的Transformer模型,因其卓越性能已成为各领域的主导模型。然而,其高昂的计算成本阻碍了高效的硬件部署。稀疏性提供了有前景的解决方案,但现有加速器大多仅利用注意力机制中的行内稀疏性,而鲜有考虑行间稀疏性。利用行间稀疏性的方法通常依赖昂贵的全局相似性估计,这削弱了稀疏性带来的加速效益,且通常仅将稀疏性应用于一至两个Transformer组件。通过对注意力分布和计算流程的细致分析,我们发现局部相似性能够以更低计算开销实现端到端稀疏加速。基于此观察,我们提出ESACT——一种面向计算密集型Transformer的端到端稀疏加速器。ESACT的核心是局部相似性稀疏预测机制,该机制利用HLog量化在QK生成前准确预测局部注意力稀疏性,实现所有Transformer组件的高效稀疏化。为支持高效硬件实现,我们引入三项架构创新。在26个基准测试上的实验结果表明,SPLS机制在精度损失小于1%的前提下将总计算量降低52.03%。ESACT实现了3.29 TOPS/W的端到端能效,在注意力层级能效上分别较先进注意力加速器SpAtten和Sanger提升2.95倍和2.26倍。