Unstructured pruning reduces the memory footprint in deep neural networks (DNNs). Recently, researchers proposed different types of structural pruning intending to reduce also the computation complexity. In this work, we first suggest a new measure called mask-diversity which correlates with the expected accuracy of the different types of structural pruning. We focus on the recently suggested N:M fine-grained block sparsity mask, in which for each block of M weights, we have at least N zeros. While N:M fine-grained block sparsity allows acceleration in actual modern hardware, it can be used only to accelerate the inference phase. In order to allow for similar accelerations in the training phase, we suggest a novel transposable fine-grained sparsity mask, where the same mask can be used for both forward and backward passes. Our transposable mask guarantees that both the weight matrix and its transpose follow the same sparsity pattern; thus, the matrix multiplication required for passing the error backward can also be accelerated. We formulate the problem of finding the optimal transposable-mask as a minimum-cost flow problem. Additionally, to speed up the minimum-cost flow computation, we also introduce a fast linear-time approximation that can be used when the masks dynamically change during training. Our experiments suggest a 2x speed-up in the matrix multiplications with no accuracy degradation over vision and language models. Finally, to solve the problem of switching between different structure constraints, we suggest a method to convert a pre-trained model with unstructured sparsity to an N:M fine-grained block sparsity model with little to no training. A reference implementation can be found at https://github.com/papers-submission/structured_transposable_masks.
翻译:无结构的修剪会减少深层神经网络(DNNs)的内存足迹。最近,研究人员提出了不同类型的结构修剪计划,目的是降低计算复杂性。在这项工作中,我们首先建议了一种叫面罩多样性的新措施,它与不同类型结构修剪的预期准确性相关。我们集中关注最近建议的 N:M 微粒的区块偏移面罩,每块M 重量中,我们至少有零。N:M 微粒/ 块块块宽度加速了实际现代硬件的加速速度,但只能用来加速推导阶段。为了让培训阶段的类似加速速度,我们建议了一种新型的面罩,在前向和后向传递时,可以使用同样的面罩。我们的变换面面面罩保证了重量矩阵和变压前的形态模式,因此,通过错误后所需的矩阵倍增量也可以加速。我们提出了在最优的可变缩缩略式参考速度结构中找到一个最小的缩略式缩略图,当我们使用了最低的直流流期间,最后的缩缩缩缩图。