Large-scale language models (LLMs) have demonstrated outstanding performance on various tasks, but their deployment poses challenges due to their enormous model size. In this paper, we identify that the main challenge in quantizing LLMs stems from the different activation ranges between the channels, rather than just the issue of outliers.We propose a novel reorder-based quantization approach, RPTQ, that addresses the issue of quantizing the activations of LLMs. RPTQ rearranges the channels in the activations and then quantizing them in clusters, thereby reducing the impact of range difference of channels. In addition, we reduce the storage and computation overhead by avoiding explicit reordering. By implementing this approach, we achieved a significant breakthrough by pushing LLM models to 3 bit activation for the first time.
翻译:大型语言模型(LLMs)已经在各种任务中展示出杰出的性能,但它们的部署由于巨大的模型大小而面临挑战。在本文中,我们确定了量化LLMs的主要挑战源自通道之间不同的激活范围,而不仅仅是离群值问题。我们提出了一种新颖的基于重排的量化方法RPTQ,它解决了量化LLMs激活位置的问题。RPTQ通过重新排列激活中的通道,然后将它们分组量化,从而减少通道之间范围差异的影响。此外,通过避免明确的重新排序,我们减少了存储和计算开销。通过实施这种方法,我们首次将LLM模型推向了3位激活。