Accelerating the neural network inference by FPGA has emerged as a popular option, since the reconfigurability and high performance computing capability of FPGA intrinsically satisfies the computation demand of the fast-evolving neural algorithms. However, the popular neural accelerators on FPGA (e.g., Xilinx DPU) mainly utilize the DSP resources for constructing their processing units, while the rich LUT resources are not well exploited. Via the software-hardware co-design approach, in this work, we develop an FPGA-based heterogeneous computing system for neural network acceleration. From the hardware perspective, the proposed accelerator consists of DSP- and LUT-based GEneral Matrix-Multiplication (GEMM) computing cores, which forms the entire computing system in a heterogeneous fashion. The DSP- and LUT-based GEMM cores are computed w.r.t a unified Instruction Set Architecture (ISA) and unified buffers. Along the data flow of the neural network inference path, the computation of the convolution/fully-connected layer is split into two portions, handled by the DSP- and LUT-based GEMM cores asynchronously. From the software perspective, we mathematically and systematically model the latency and resource utilization of the proposed heterogeneous accelerator, regarding varying system design configurations. Through leveraging the reinforcement learning technique, we construct a framework to achieve end-to-end selection and optimization of the design specification of target heterogeneous accelerator, including workload split strategy, mixed-precision quantization scheme, and resource allocation of DSP- and LUT-core. In virtue of the proposed design framework and heterogeneous computing system, our design outperforms the state-of-the-art Mix&Match design with latency reduced by 1.12-1.32x with higher inference accuracy. The N3H-core is open-sourced at: https://github.com/elliothe/N3H_Core.
翻译:FPGA加速神经网络的推断是一个受欢迎的选择,因为FPGA的重新配置和高性能计算能力自然满足了快速演进神经算法的计算需求。然而,FPGA(例如Xilinx DPU)上流行的神经加速器主要利用DSP资源来建造其处理器,而丰富的LUT资源没有得到很好的利用。通过软件硬件硬件/共同设计方法,在这项工作中,我们开发了一个基于FPGA的加速神经网络的混合计算系统。从硬件角度,拟议的加速器由基于DSP和LUT的GE-MUMU计算核心组成。DSP和基于LUT的GEMM核心资源,通过W.r.r.t一个统一的指令设置架构(ISA)和统一的缓冲框架,在神经网络的加速路径上的数据流中,从数字系统(包括数字系统)的升级/完全升级战略分配, 混合的系统(GEMMA) 设计核心,通过数字系统(包括数字系统)的升级和数字系统(MD)的升级,在数据库中,在数据库中,在数据库中,通过数字(包括数字-RIMA-RD)的利用中,在两个部分中,通过数字-RD-RD)的计算,将数据流的计算,在数字-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SDL-SD-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-D-D-S-D-S-S-S-D-D-D-D-D-D-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-D-D-D