Transformers are at the core of modern AI nowadays. They rely heavily on matrix multiplication and require efficient acceleration due to their substantial memory and computational requirements. Quantization plays a vital role in reducing memory usage, and can be exploited for computations by designing reconfigurable architectures that enhance matrix multiplication by dynamically adjusting the precision. This paper proposes ADiP, a novel adaptive-precision systolic array architecture designed for efficient matrix multiplication acceleration.The proposed architecture consists of NxN adaptive-precision processing elements (PEs) and shared accumulators. ADiP supports multiple computation modes, including symmetric single-matrix multiplication as well as asymmetric multi-matrix multiplication with a shared input matrix, thereby improving data-reuse and PE utilization. In addition, ADiP maximizes the computational density by adapting to different precisions, such as 8bitx8bit, 8bitx4bit, and 8bitx2bit. Analytical models are developed for ADiP architecture, including latency and throughput for versatile architecture configurations. A comprehensive hardware design space exploration is demonstrated using 22nm commercial technology, achieving up to a 4x higher computational throughput. Furthermore, ADiP is evaluated on different transformer workloads from GPT-2 Medium, BERT Large, and BitNet-1.58B models, delivering latency improvement up to 53.6%, and energy improvement up to 24.4% for BitNet-1.58B MHA workloads. At a 64x64 size with 4096 PEs, ADiP achieves a peak throughput of 8.192 TOPS, 16.384 TOPS, and 32.768 TOPS for 8bitx8bit, 8bitx4bit, and 8bitx2bit operations, respectively.
翻译:Transformer是当今现代人工智能的核心。它们严重依赖于矩阵乘法,并且由于其巨大的内存和计算需求,需要高效的加速。量化在减少内存使用方面起着至关重要的作用,并且可以通过设计可重构架构来利用其进行计算,该架构通过动态调整精度来增强矩阵乘法。本文提出了ADiP,一种新颖的自适应精度脉动阵列架构,专为高效的矩阵乘法加速而设计。所提出的架构由NxN自适应精度处理单元(PE)和共享累加器组成。ADiP支持多种计算模式,包括对称单矩阵乘法以及具有共享输入矩阵的非对称多矩阵乘法,从而提高了数据重用率和PE利用率。此外,ADiP通过适应不同精度(例如8比特x8比特、8比特x4比特和8比特x2比特)来最大化计算密度。为ADiP架构开发了分析模型,包括针对多种架构配置的延迟和吞吐量。使用22纳米商用技术进行了全面的硬件设计空间探索,实现了高达4倍的计算吞吐量提升。此外,ADiP在来自GPT-2 Medium、BERT Large和BitNet-1.58B模型的不同Transformer工作负载上进行了评估,对于BitNet-1.58B MHA工作负载,延迟改善高达53.6%,能效改善高达24.4%。在64x64规模、拥有4096个PE的情况下,ADiP对于8比特x8比特、8比特x4比特和8比特x2比特操作分别实现了8.192 TOPS、16.384 TOPS和32.768 TOPS的峰值吞吐量。