In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) by x2, and doubles throughput by skipping computation of zero values. So far, it was only used to prune weights. We examine how this method can be used also for activations and their gradients (i.e., "neural gradients"). To this end, we first establish a tensor-level optimality criteria. Previous works aimed to minimize the mean-square-error (MSE) of each pruned block. We show that while minimization of the MSE works fine for pruning the activations, it catastrophically fails for the neural gradients. Instead, we show that optimal pruning of the neural gradients requires an unbiased minimum-variance pruning mask. We design such specialized masks, and find that in most cases, 1:2 sparsity is sufficient for training, and 2:4 sparsity is usually enough when this is not the case. Further, we suggest combining several such methods together in order to potentially speed up training even more. A reference implementation is supplied in https://github.com/brianchmiel/Act-and-Grad-structured-sparsity.
翻译:在深层学习中,微微微微的N:M 宽度减少了通用矩阵乘数(GEMM)x2的数据足迹和带宽,而跳过计算零值,使数据量翻倍。 到目前为止,它只用于微量重量。 我们研究这种方法如何也能用于激活和梯度(即“神经梯度”)。 为此,我们首先制定一个高压最佳标准。 先前的工程旨在将每个小块的平均平方- 高度( MSE) 最小化。 我们显示, 最小化的MSE在运行启动时效果良好, 但它在神经梯度上却灾难性地失败。 相反, 我们显示, 神经梯度的最佳修剪裁需要一个不带偏见的最小差异性最小化遮罩。 我们设计了这样的专门遮罩, 在多数情况下, 1: 2 温度足以用于培训, 而 2 4 通常在这样的情况下就足够了。 此外, 我们建议将几种这类方法合并在一起, 以便可能的速度提升培训速度 。 ADRADA 和 ASG 提供。