Deep neural networks (DNNs) have been successfully employed in a multitude of applications with remarkable performance. As such performance is achieved at a significant computational cost, several embedded applications demand fast and efficient hardware accelerators for DNNs. Previously proposed application specific integrated circuit (ASIC) architectures strive to utilize arrays of hundreds of processing elements (PEs) and reduce power-hungry DRAM accesses using multiple dataflows requiring complex PE architectures. These consume significant area and reduce the maximum clock frequency. This paper introduces the Kraken architecture, which optimally processes the convolutional layers, fully-connected layers, and matrix products of any DNN through a hardware-friendly uniform dataflow. This enables maximal data reuse of weights, inputs, and outputs, with a bare-bones PE design and on-the-fly dynamic reconfiguration. Kraken, implemented in 65-nm CMOS technology at 400 MHz, packs 672 PEs in 7.3 mm2, with a peak performance of 537.6 Gops. Kraken processes the convolutional layers of AlexNet, VGG-16, and ResNet-50 at 336.6, 17.5, and 64.2 frames/s, respectively, hence outperforming the state-of-the-art ASIC architectures in terms of overall performance efficiency, DRAM accesses, arithmetic intensity, and throughput, with 5.8x more Gops/mm2 and 1.6x more Gops/W.
翻译:深心神经网络(DNNS)已被成功应用于许多具有显著性能的应用中,由于这种性能以很高的计算成本实现,一些嵌入应用程序要求DNNS快速高效的硬件加速器。以前提议的应用程序具体集成电路(ASIC)结构努力利用数百个处理元件的阵列,并使用需要复杂的PE结构的多个数据流减少对电饥饿的DRAM访问。这些系统消耗大量区域,并减少最大时钟频率。本文件介绍克拉肯结构,通过硬件友好的统一数据流,优化处理任何DNNS的卷层、完全连接层和矩阵产品。这可以使重量、投入和产出的最大化数据再利用,同时使用光ones PE 设计和在飞行时的动态重组。克拉肯,在65-nm CMOS 技术中实施了400兆赫,将672个PE以7.mm2为单位,最高性能为537.6戈普斯。克拉肯将AlexNet、VGG-16和ResNet-60x的基层平面、A-366、17.5、GGO-RA-SAL-SAL-SAL-SD-SD-SB-C-C-C-C-C-C-C-C-C-SDRIS-C-C-C-C-SD-SB-SB-SD-S-SB-SD-SD-SB-SB-C-S-S-SB-SD-S-SD-SD-SD-SD-x、17.5、17.5、17.5、更高性能/B-S-S-S-S-SD-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SD-SD-S-S-S-S-S-S-SD-x-x-SD-x-x-x-SD-xxx、17.5、17.5、17.5、17/5-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S