Memory bandwidth has become the real-time bottleneck of current deep learning accelerators (DLA), particularly for high definition (HD) object detection. Under resource constraints, this paper proposes a low memory traffic DLA chip with joint hardware and software optimization. To maximize hardware utilization under memory bandwidth, we morph and fuse the object detection model into a group fusion-ready model to reduce intermediate data access. This reduces the YOLOv2's feature memory traffic from 2.9 GB/s to 0.15 GB/s. To support group fusion, our previous DLA based hardware employes a unified buffer with write-masking for simple layer-by-layer processing in a fusion group. When compared to our previous DLA with the same PE numbers, the chip implemented in a TSMC 40nm process supports 1280x720@30FPS object detection and consumes 7.9X less external DRAM access energy, from 2607 mJ to 327.6 mJ.
翻译:内存带宽已成为当前深学习加速器(DLA)的实时瓶颈,特别是在高定义(HD)物体探测方面。在资源限制下,本文件提出一个低内存传输 DLA芯片,配有联合硬件和软件优化。为了在内存带宽下最大限度地利用硬件,我们将物体探测模型转换成一个可以组合的集成模型,以减少中间数据访问量。这减少了YOLOv2的特征内存流量,从2.9GB/s减少到0.15GB/s。为了支持群集,我们以前的DLA硬件使用一个统一的缓冲,在聚组中为简单的逐层处理进行书写制。与我们以前的DLA和相同的PE数字相比,在TSMC 40nm进程中安装的芯片支持了1,280x720@30FPS物体探测,并消耗了7.9X较少的外部DRAM访问能量,从2607mJ到327.6 mJ。