Though network sparsity emerges as a promising direction to overcome the drastically increasing size of neural networks, it remains an open problem to concurrently maintain model accuracy as well as achieve significant speedups on general CPUs. In this paper, we propose one novel concept of $1\times N$ block sparsity pattern (block pruning) to break this limitation. In particular, consecutive $N$ output kernels with the same input channel index are grouped into one block, which serves as a basic pruning granularity of our pruning pattern. Our $1 \times N$ sparsity pattern prunes these blocks considered unimportant. We also provide a workflow of filter rearrangement that first rearranges the weight matrix in the output channel dimension to derive more influential blocks for accuracy improvements, and then applies similar rearrangement to the next-layer weights in the input channel dimension to ensure correct convolutional operations. Moreover, the output computation after our $1 \times N$ block sparsity can be realized via a parallelized block-wise vectorized operation, leading to significant speedups on general CPUs-based platforms. The efficacy of our pruning pattern is proved with experiments on ILSVRC-2012. For example, in the case of 50% sparsity and $N=4$, our pattern obtains about 3.0% improvements over filter pruning in the top-1 accuracy of MobileNet-V2. Meanwhile, it obtains 56.04ms inference savings on Cortex-A7 CPU over weight pruning. Code is available at https://github.com/lmbxmu/1xN.
翻译:虽然网络偏观是克服快速增长的神经网络规模的一个有希望的方向,但同时保持模型精度和实现一般CPU的大幅超速仍是一个未解决的问题。在本文中,我们提出了一个1美元N$块偏斜模式的新概念,以打破这一限制。特别是,将具有相同输入通道指数的连续的 $N$ 输出内核内核内核分为一个区块,作为我们运行模式的基本螺旋颗粒。我们1 美元 N$ 宽度模式在一般CPU 中,这些区块被认为无关紧要。我们还提供了一个过滤再排列流程,首次对输出通道的重量矩阵进行重新排列,以获得更具有影响力的块块来提高准确性,然后对输入通道的下一层内输出内核部分进行类似的重新排列,以确保正确的转动操作。此外,在我们1美元后,N$ 平流螺旋螺旋螺旋螺旋螺旋螺旋螺旋螺旋螺旋矩形后,通过平行的块分流化操作实现产出的计算,导致在一般 CPPPUS 4 的精度平台上进行显著的加速。