Hardware accelerators for convolution neural networks (CNNs) enable real-time applications of artificial intelligence technology. However, most of the existing designs suffer from low hardware utilization or high area cost due to complex dataflow. This paper proposes a hardware efficient vectorwise CNN accelerator that adopts a 3$\times$3 filter optimized systolic array using 1-D broadcast dataflow to generate partial sum. This enables easy reconfiguration for different kinds of kernels with interleaved input or elementwise input dataflow. This simple and regular data flow results in low area cost while attains high hardware utilization. The presented design achieves 99\%, 97\%, 93.7\%, 94\% hardware utilization for VGG-16, ResNet-34, GoogLeNet, and Mobilenet, respectively. Hardware implementation with TSMC 40nm technology takes 266.9K NAND gate count and 191KB SRAM to support 168GOPS throughput and consumes only 154.98mW when running at 500MHz operating frequency, which has superior area and power efficiency than other designs.
翻译:移动神经网络的硬件加速器(CNNs)能够实时应用人工智能技术。然而,大多数现有设计由于复杂的数据流而导致硬件利用率低或地区成本高。本文建议使用一个3$3的硬高效矢量有线电视加速器,采用3$3的过滤器优化同步阵列,使用1D广播数据流生成部分总和。这样可以方便地重新配置不同种类的内核,有内置输入或元素输入数据流。这种简单和定期的数据流在达到高硬件利用率的同时,会降低区域成本。推出的设计实现了99 ⁇ 、97 ⁇ 、93.7 ⁇ 、94 ⁇ 硬件用于VGG-16、ResNet-34、GooogLeNet和移动网络。使用TSMC 40nm技术的硬件实施需要266.9K NAND门计数和191KB SRAM,以支持168GOPS通过量或元素输入数据流,在运行500MHz操作频率时仅消耗了154.98mW,而该频率比其他设计高。