The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular. At the same time, there is rapidly-growing computational support for efficiently executing the unstructured-sparse models obtained via pruning. Yet, most existing pruning methods minimize just the number of remaining weights, i.e. the size of the model, rather than optimizing for inference time. We address this gap by introducing SPDY, a new compression method which automatically determines layer-wise sparsity targets achieving a desired inference speedup on a given system, while minimizing accuracy loss. SPDY is composed of two new techniques: the first is an efficient dynamic programming algorithm for solving the speedup-constrained layer-wise compression problem assuming a set of given layer-wise sensitivity scores; the second is a local search procedure for determining accurate layer-wise sensitivity scores. Experiments across popular vision and language models show that SPDY guarantees speedups while recovering higher accuracy relative to existing strategies, both for one-shot and gradual pruning scenarios, and is compatible with most existing pruning approaches. We also extend our approach to the recently-proposed task of pruning with very little data, where we achieve the best known accuracy recovery when pruning to the GPU-supported 2:4 sparsity pattern.
翻译:最近对深神经网络(DNNs)效率的重视导致在模型压缩方法方面做了大量工作,其中重量调整是最受欢迎的方法之一。同时,对高效执行通过裁剪获得的无结构的碎型模型的计算支持迅速增长。然而,大多数现有裁剪方法仅将剩余重量的数量最小化,即模型的大小,而不是最佳推导时间。我们通过引入SPDY来解决这一差距。SPDY是一种新的压缩方法,它自动决定层与层之间的宽度目标,在给定的系统上实现预期的加速推力,同时尽量减少准确性损失。SPDY由两种新技术组成:第一是高效的动态编程算法,用于解决通过裁剪裁获得的无结构的碎版模型模型。第二是用于确定准确度分数的本地搜索程序。跨流行视觉和语言模型的实验表明,SPDY保证加速速度,同时恢复相对于现有战略的更准确性,既为一发式和渐进式的加速度,又尽量减少准确性损失。SPDY是由两种新技术构成的高效动态编程方法,我们最近才算出的最精确的方法。