Existing deep convolutional neural networks (CNNs) generate massive interlayer feature data during network inference. To maintain real-time processing in embedded systems, large on-chip memory is required to buffer the interlayer feature maps. In this paper, we propose an efficient hardware accelerator with an interlayer feature compression technique to significantly reduce the required on-chip memory size and off-chip memory access bandwidth. The accelerator compresses interlayer feature maps through transforming the stored data into frequency domain using hardware-implemented 8x8 discrete cosine transform (DCT). The high-frequency components are removed after the DCT through quantization. Sparse matrix compression is utilized to further compress the interlayer feature maps. The on-chip memory allocation scheme is designed to support dynamic configuration of the feature map buffer size and scratch pad size according to different network-layer requirements. The hardware accelerator combines compression, decompression, and CNN acceleration into one computing stream, achieving minimal compressing and processing delay. A prototype accelerator is implemented on an FPGA platform and also synthesized in TSMC 28-nm COMS technology. It achieves 403GOPS peak throughput and 1.4x~3.3x interlayer feature map reduction by adding light hardware area overhead, making it a promising hardware accelerator for intelligent IoT devices.
翻译:现有的深层神经神经网络(CNNs)在网络发酵期间产生大量的跨层特征数据。 要在嵌入系统中保持实时处理, 需要大型芯内存以缓冲层间特征图。 在本文件中, 我们提议一个高效硬件加速器, 配有层间特征压缩技术, 以大幅降低所需的芯内存内存尺寸和离芯内存存访问带宽度。 加速器将存储的数据转换成频率域, 使用硬件执行的 8x8 离散 Cosine 变异(DCT) 。 在 DCT 后通过量化去除高频组件。 使用粗缩缩式矩阵压缩来进一步压缩跨层特征图。 芯内存分配方案旨在支持根据不同网络级别要求对地貌缓冲大小和刮片大小进行动态配置。 硬件加速器将存储的数据转换为一个计算流, I- 降压和CNN 加速度, 实现最小压缩和处理延迟。 高频组件加速器在DCTC 28- MAS 平流中, 将一个有希望的硬化的硬体间硬体内, 将一个硬体内硬体内硬体内, 。