幻影:失序进化神经网络的高性能计算核心 (Phantom: A High-Performance Computational Core for Sparse Convolutional Neural Networks)

Sparse convolutional neural networks (CNNs) have gained significant traction over the past few years as sparse CNNs can drastically decrease the model size and computations, if exploited befittingly, as compared to their dense counterparts. Sparse CNNs often introduce variations in the layer shapes and sizes, which can prevent dense accelerators from performing well on sparse CNN models. Recently proposed sparse accelerators like SCNN, Eyeriss v2, and SparTen, actively exploit the two-sided or full sparsity, that is, sparsity in both weights and activations, for performance gains. These accelerators, however, either have inefficient micro-architecture, which limits their performance, have no support for non-unit stride convolutions and fully-connected (FC) layers, or suffer massively from systematic load imbalance. To circumvent these issues and support both sparse and dense models, we propose Phantom, a multi-threaded, dynamic, and flexible neural computational core. Phantom uses sparse binary mask representation to actively lookahead into sparse computations, and dynamically schedule its computational threads to maximize the thread utilization and throughput. We also generate a two-dimensional (2D) mesh architecture of Phantom neural computational cores, which we refer to as Phantom-2D accelerator, and propose a novel dataflow that supports all layers of a CNN, including unit and non-unit stride convolutions, and FC layers. In addition, Phantom-2D uses a two-level load balancing strategy to minimize the computational idling, thereby, further improving the hardware utilization. To show support for different types of layers, we evaluate the performance of the Phantom architecture on VGG16 and MobileNet. Our simulations show that the Phantom-2D accelerator attains a performance gain of 12x, 4.1x, 1.98x, and 2.36x, over dense architectures, SCNN, SparTen, and Eyeriss v2, respectively.

翻译：螺旋螺旋神经网络(CN98)在过去几年中获得了显著的牵引力,因为稀有的CNN可以大幅降低模型规模和计算,如果与密度高的网络相比,开发得合适的话,可以大幅降低模型规模和计算。松散的CNN经常在层形状和大小上引入差异,这可以防止密度加速器在稀有的CNN模型上运行良好。最近提议了SCNN、Eyeriss v2和SparTen等稀薄的加速器,积极利用双向或完全的孔隙,即,重量和激活的宽度,可以大幅降低模型的尺寸和计算。不过,这些加速器要么是效率低下的微结构,这限制了它们的性能,对非层层的微结构没有支持,对于非单位的螺旋共振动结构来说, 快速的节流、动态的节流、动态的节流、动态的节流、动态的节流、动态的节流、动态的节流、动态的节流、动态的节流、动态的节流、动态的节流、动态的节流的节流的节。手的双双面的双面面的面面面面面的显示,可以积极地显示,我们的两个计算和动态的计算和动态的周期的计算结构,我们通过两层的计算和动态的周期的节能的周期的周期的周期的周期的周期的周期的计算,可以分别显示的计算和动态的周期的计算。