The combination of Winograd's algorithm and systolic array architecture has demonstrated the capability of improving DSP efficiency in accelerating convolutional neural networks (CNNs) on FPGA platforms. However, handling arbitrary convolution kernel sizes in FPGA-based Winograd processing elements and supporting efficient data access remain underexplored. In this work, we are the first to propose an optimized Winograd processing element (WinoPE), which can naturally support multiple convolution kernel sizes with the same amount of computing resources and maintains high runtime DSP efficiency. Using the proposed WinoPE, we construct a highly efficient systolic array accelerator, termed WinoCNN. We also propose a dedicated memory subsystem to optimize the data access. Based on the accelerator architecture, we build accurate resource and performance modeling to explore optimal accelerator configurations under different resource constraints. We implement our proposed accelerator on multiple FPGAs, which outperforms the state-of-the-art designs in terms of both throughput and DSP efficiency. Our implementation achieves DSP efficiency up to 1.33 GOPS/DSP and throughput up to 3.1 TOPS with the Xilinx ZCU102 FPGA. These are 29.1\% and 20.0\% better than the best solutions reported previously, respectively.
翻译:Winograd的算法和同步阵列结构的结合表明,在加速FPGA平台上的进化神经网络(CNNs)方面,能够提高DSP的效率,加快FPGA平台上的进化神经网络(CNNs),然而,在基于FPGA的Winograd处理元素中,处理任意的进化内核内核规模和支持高效的数据访问方面,仍然未得到充分探讨。在这项工作中,我们首先提出一个优化的WinoPED处理元件组(WinoPE)元件组(WinoPE),该元件组可以自然地支持多个进化内核规模,并保持高运行速度的DSP效率。我们利用拟议的WinoPEPE, 建造了一个高效高效的 Systolic 阵列加速器,称为WinoCNN。我们还提议建立一个专门的记忆系统来优化数据访问。根据加速器结构,我们建立了精确的资源和性模型,以探索不同资源制约下的最佳的加速器配置配置。我们在多个FPGPA20和DPSPA上分别报告的最高效率。