Single computation engines have become a popular design choice for FPGA-based convolutional neural networks (CNNs) enabling the deployment of diverse models without fabric reconfiguration. This flexibility, however, often comes with significantly reduced performance on memory-bound layers and resource underutilisation due to suboptimal mapping of certain layers on the engine's fixed configuration. In this work, we investigate the implications in terms of CNN engine design for a class of models that introduce a pre-convolution stage to decompress the weights at run time. We refer to these approaches as on-the-fly. To minimise the negative impact of limited bandwidth on memory-bound layers, we present a novel hardware component that enables the on-chip on-the-fly generation of weights. We further introduce an input selective processing element (PE) design that balances the load between PEs on suboptimally mapped layers. Finally, we present unzipFPGA, a framework to train on-the-fly models and traverse the design space to select the highest performing CNN engine configuration. Quantitative evaluation shows that unzipFPGA yields an average speedup of 2.14x and 71% over optimised status-quo and pruned CNN engines under constrained bandwidth and up to 3.69x higher performance density over the state-of-the-art FPGA-based CNN accelerators.
翻译:单一计算引擎已成为FPGA基于FPGA的共变神经网络(CNNs)最受欢迎的设计选择,使得在不进行结构重组的情况下能够部署多种模型。然而,这种灵活性往往随着对引擎固定配置中某些层次的不优化绘图,使内存层和资源利用不足的性能显著下降。在这项工作中,我们调查了CNN引擎设计对一组模型的影响,这些模型引入了革命前阶段,以降低运行时重量。我们将这些方法称为“在空中”。为了尽可能减少有限带宽对内存层的负面影响,我们提出了一个新型硬件组件,使机载机载载重量生成的重量功能显著下降。我们进一步引入了一个投入选择性处理元素(PE)设计,以平衡在次优化的绘图层上PE的负荷。最后,我们提出了“CNNFPGAA”框架,用于对机载模型进行培训,并穿越设计空间,以选择最高级的CNN发动机配置。定量评价显示,在SNFGA上不ziPGA生成了平均速度的S-69和S-CFA的高级S-CS-CRS-GIS-CR-C-C-CR-C-C-C-C-C-C-C-C-C-C-C-CR-C-C-S-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-S-C-S-S-C-C-C-C-C-C-C-C-S-S-S-S-S-S-S-S-S-S-S-S-S-C-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S