The acceleration of deep-learning kernels in hardware relies on matrix multiplications that are executed efficiently on Systolic Arrays (SA). To effectively trade off deep-learning training/inference quality with hardware cost, SA accelerators employ reduced-precision Floating-Point (FP) arithmetic. In this work, we demonstrate the need for new pipeline organizations to reduce latency and improve energy efficiency of reduced-precision FP operators for the chained multiply-add operation imposed by the structure of the SA. The proposed skewed pipeline design reorganizes the pipelined operation of the FP multiply-add units to enable new forwarding paths for the exponent logic, which allow for parallel execution of the pipeline stages of consecutive PEs. As a result, the latency of the matrix multiplication operation within the SA is significantly reduced with minimal hardware cost, thereby yielding an energy reduction of 8% and 11% for the examined state-of-the-art CNNs.
翻译:在硬件加速深度学习内核中,矩阵乘法可以在Systolic阵列(SA)上执行,从而实现高效率。为了有效地在深度学习训练/推断质量与硬件成本之间进行折衷,SA加速器采用降精度的浮点算术。在本文中,我们展示了需要新的管道组织来减少降精度浮点运算操作的延迟和提高SA运算符的能效。所提出的倾斜流水线设计重新组织了浮点乘加单元的流水线操作,使指数逻辑的新转发路径能够并行执行连续PE的流水线阶段。其结果,SA中的矩阵乘法操作的延迟显著降低,而硬件成本最小,从而为检查的最先进的CNN提供了8%和11%的能源减少。