Generative Pre-trained Transformer (GPT) models set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs to execute, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 2x when using high-end GPUs (NVIDIA A100) and 4x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.
翻译:培训前变形器(GPT)模型通过复杂语言建模任务的突破性业绩,但也通过极高的计算和储存成本,而通过600多语言建模任务的突破性表现而形成。具体地说,由于其规模庞大,甚至对大型、高度精准的GPT模型的推断,可能需要执行多个性能强的GPT GPPP,这限制了这些模型的可用性。虽然正在通过模型压缩来缓解这种压力,但现有压缩技术的适用性和性受到GPT模型的规模和复杂性的限制。在本文中,我们应对这一挑战,并提议GPTQ,即基于大约二阶信息的新的一发重量四分制法,既高度精确又高效。具体地说,GPTQQQ可以将GPT模型四小时内具有175亿参数的GPT四分方位数进行四分解,比未压缩的基线差得多。我们的方法比先前提出的一发式平价平价方法增加了一倍(SLIA),保持准确性,让我们第一次在GPIA 10万分数内执行GPA。