Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Among different categories of sparsity, structured sparsity has gained more attention due to its efficient execution on modern accelerators. Particularly, N:M sparsity is attractive because there are already hardware accelerator architectures that can leverage certain forms of N:M structured sparsity to yield higher compute-efficiency. In this work, we focus on N:M sparsity and extensively study and evaluate various training recipes for N:M sparsity in terms of the trade-off between model accuracy and compute cost (FLOPs). Building upon this study, we propose two new decay-based pruning methods, namely "pruning mask decay" and "sparse structure decay". Our evaluations indicate that these proposed methods consistently deliver state-of-the-art (SOTA) model accuracy, comparable to unstructured sparsity, on a Transformer-based model for a translation task. The increase in the accuracy of the sparse model using the new training recipes comes at the cost of marginal increase in the total training compute (FLOPs).
翻译:分化已成为压缩和加速深神经网络(DNNS)的有希望的方法之一。 在不同的广度类别中,结构宽度因其在现代加速器上的有效操作而得到更多关注。 特别是, N:M 宽度具有吸引力,因为已有硬件加速器结构能够利用某些形式的N:M结构宽度来提高计算效率。 在这项工作中,我们侧重于N:M 宽度,并广泛研究和评价N:M 宽度在模型精确度和计算成本(FLOPs)之间的取舍方面的各种培训食谱。基于这项研究,我们提出了两种新的基于衰变的运行方法,即“修补面罩衰变”和“结构衰变”。我们的评估表明,这些拟议方法始终以一个基于变异器的模型为标准,与非结构宽度相比,用于翻译任务。使用新培训食谱的稀薄模型的精度增加是以最低水平为成本的(FLLOP) 。