Massively parallel systolic arrays and resource-efficient depthwise separable convolutions are two promising techniques to accelerate DNN inference on the edge. Interestingly, their combination is inefficient: Computational patterns of depthwise separable convolutions do not exhibit a rhythmic systolic flow and lack sufficient data reuse to saturate systolic arrays. We formally analyse this inefficiency and propose an efficient operator, an optimal hardware dataflow, and a superior training methodology towards alleviating this. The efficient operator, called FuSeConv, is a drop-in replacement for depthwise separable convolutions. FuSeConv factorizes convolution fully along their spatial and depth dimensions. The resultant computation efficiently maps to systolic arrays. The optimal dataflow, called Spatial-Tiled Output Stationary (ST-OS), maximizes the efficiency of FuSeConv on systolic arrays. It maps independent convolutions to rows of the array to maximize resource utilization with negligible VLSI overheads. Neural Operator Scaffolding (NOS) scaffolds the training of FuSeConv by distilling knowledge from the expensive depthwise separable convolutions. This bridges the accuracy gap between FuSeConv networks and baselines. Additionally, NOS can be combined with Neural Architecture Search (NAS) to trade-off latency and accuracy. The HW/SW co-design of FuSeConv with ST-OS achieves a significant speedup of 4.1-9.25X with state-of-the-art efficient networks for ImageNet. The parameter efficiency of FuSeConv and its significant out-performance over depthwise separable convolutions on systolic arrays illustrates their promise as a strong solution on the edge. Training FuSeConv networks with NOS achieves accuracy comparable to the baselines. Further, by combining NOS with NAS, we design networks that define state-of-the-art models improving on both accuracy and latency on systolic arrays.
翻译:大规模平行的同步阵列和资源高效的深度分解变异是加速 DNN 边缘变异的两个有希望的技术。有趣的是,它们的组合是低效的:深度分解变异的计算模式没有显示有节奏的系统流,并且缺乏足够的数据再利用以饱和的同步阵列。我们正式分析这种效率低下,并提议一个高效的操作员、最佳的硬件数据流和更好的培训方法来缓解这种情况。高效的操作员叫Fuse Conv,是深度分解变异的替代。Fuse Convol 混合的组合。 Fuse Convolation 将深度和空间分解变异的变异关系完全合并在一起。Fuse Consttreal-liver Slights 将快速的变异性变异性数据流(NOS) Scide condreal-deal-developy 电流的变异性关系流,可以通过不断的变异性变异的变异关系网络,可以将这种变异性变异的变异的变异的变异性网络和变异的变异性电的变的变异性网络进行。