Acceleration of Convolutional Neural Network (CNN) on edge devices has recently achieved a remarkable performance in image classification and object detection applications. This paper proposes an efficient and scalable CNN-based SoC-FPGA accelerator design that takes pre-trained weights with a 16-bit fixed-point quantization and target hardware specification to generate an optimized template capable of achieving higher performance versus resource utilization trade-off. The template analyzed the computational workload, data dependency, and external memory bandwidth and utilized loop tiling transformation along with dataflow modeling to convert convolutional and fully connected layers into vector multiplication between input and output feature maps, which resulted in a single compute unit on-chip. Furthermore, the accelerator was examined among AlexNet, VGG16, and LeNet networks and ran at 200-MHz with a peak performance of 230 GOP/s depending on ZYNQ boards and state-space exploration of different compute unit configurations during simulation and synthesis. Lastly, our proposed methodology was benchmarked against the previous development on Ultra96 for higher performance measurement.
翻译:在边缘装置上加速进化神经网络(CNN)最近在图像分类和物体探测应用方面最近取得了显著的成绩。本文件建议采用高效和可扩缩的CNN 的 SoC-FPGA加速器设计,在16位固定点量和目标硬件规格下进行预先训练的重量,采用16位固定点量和目标硬件规格,以产生一个最优化的模板,能够实现更高的性能和资源利用权衡,该模板分析了计算工作量、数据依赖性和外部记忆带宽,并使用了滚动式转换,同时使用数据流模型,将电动和完全连接层转换成输入和输出特征图之间的矢量倍增,从而产生了一个单一的计算装置。此外,AlexNet、VGG16和LeNet网络对加速器进行了检查,在200-MHz进行了运行,最高性能为230 GOP/s,这取决于ZYNQ板和在模拟和合成过程中对不同的计算单位配置进行国家空间探索。最后,我们提出的方法比Utra96先前的开发基准。