Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.
翻译:生成预训练变压器模型(GPT或 OPT)通过在复杂的语言建模任务上实现突破性的性能,以及其极高的计算和存储成本而脱颖而出。具体来说,由于其巨大的规模,即使是针对大型、高度准确的GPT模型的推断,也可能需要多个强大的GPU,这限制了这种模型的可用性。虽然已有一些压缩模型的工作正在出现,但现有压缩技术的适用性和性能受到GPT模型的规模和复杂性的限制。在本文中,我们解决了这一挑战,并提出了GPTQ,这是一种基于近似二阶信息的新型一次性权重量化方法,既高精度又高效。具体而言,GPTQ可以将1750亿参数的GPT模型量化为3或4比特每个权重,大约需要四个GPU小时,并且与未压缩基准相比几乎没有精度降级。我们的方法相对于先前提出的一次性量化方法,压缩增益超过了一倍,保持了准确性,使我们首次能够在单个GPU中执行一个1750亿参数的模型进行生成推断。此外,我们还表明,即使在极端的量化区间中,我们的方法仍然可以提供合理的精度,即将权重量化为2位或三元量化级别。我们通过实验证明,可以利用这些改进来加速端到端的推理速度,相对于FP16的提速约为3.25倍(使用高端GPU(NVIDIA A100))和4.5倍(使用更具成本效益的GPU(NVIDIA A6000))。实现可在https://github.com/IST-DASLab/gptq上获得。