Both efficient neural networks and hardware accelerators are being explored to speed up DNN inference on edge devices. For example, MobileNet uses depthwise separable convolution to achieve much lower latency, while systolic arrays provide much higher performance per watt. Interestingly however, the combination of these two ideas is inefficient: The computational patterns of depth-wise separable convolution are not systolic and lack data reuse to saturate the systolic array's constrained dataflow. In this paper, we propose FuSeConv (Fully-Separable Convolution) as a drop-in replacement for depth-wise separable convolution. FuSeConv generalizes the decomposition of convolutions fully to separable 1D convolutions along spatial and depth dimensions. The resultant computation is systolic and efficiently utilizes the systolic array with a slightly modified dataflow. With FuSeConv, we achieve a significant speed-up of 3x-7x with the MobileNet family of networks on a systolic array of size 64x64, with comparable accuracy on the ImageNet dataset. The high speed-up motivates exploration of hardware-aware Neural Operator Search (NOS) in complement to ongoing efforts on Neural Architecture Search (NAS).
翻译:高效的神经网络和硬件加速器正在探索,以加速边缘设备上的 DNN 推断。 例如, MobileNet 使用深度分解变异以达到更低的悬浮, 而 systelic 阵列则提供更高每瓦的性能。 有趣的是, 这两种想法的结合效率低: 深度分解共变的计算模式不是同步的, 并且缺乏数据再利用以饱和系统阵列受限制的数据流。 在本文中, 我们提议 FuseConv( 可分离的FullyNA convolution) 以低深度分解变异为替代, 而 systal convolution 阵列则提供更低的更替。 FusCont 将共变异的分解完全分解到空间和深度维度的 1D convolutions 。 由此得出的计算结果是非同步的, 并高效地利用对数据流进行对调。 在 FuseSawa Convol, 我们以移动网络的网络组为3x-7x, 以移动网络组取代深度分解网络的网络, 在Nestrevle Stal Streal Stal Stal Stal Streal Stal Stall 的搜索中, 。