We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.
翻译:我们第一次显示,大规模基因改造前训练型变压器(GPT)家庭模型可以在不进行任何再培训的情况下,以最小的准确性损失最小的方式,以一发子弹将至少50%的宽度压到至少50%的宽度。这是通过称为SparseGPT(SparseGPT)的新的发压方法实现的,这种方法专门设计是为了高效和准确地操作大规模的GPT-家庭模型。在对最大的开放源模型(OTP-175B和BLOOM-176B)实施SprassGPT(SprassGPT)时,我们可以达到60%的宽度,而不易解度则小地增加:很明显,这些模型的1,000亿多重在推断时可以忽略。 SprassGPT通缩为半结构模式(2:4和4:8),并且与重量四分法兼容。