Large-scale language models (LLMs) have demonstrated outstanding performance on various tasks, but their deployment poses challenges due to their enormous model size. In this paper, we identify that the main challenge in quantizing LLMs stems from the different activation ranges between the channels, rather than just the issue of outliers.We propose a novel reorder-based quantization approach, RPTQ, that addresses the issue of quantizing the activations of LLMs. RPTQ rearranges the channels in the activations and then quantizing them in clusters, thereby reducing the impact of range difference of channels. In addition, we reduce the storage and computation overhead by avoiding explicit reordering. By implementing this approach, we achieved a significant breakthrough by pushing LLM models to 3 bit activation for the first time.
翻译:大规模语言模型(LLMs)已经在各种任务中展示了出色的性能,但由于其巨大的模型大小,它们的部署存在挑战。在本文中,我们确定量化LLMs的主要挑战来自于通道之间的不同激活范围,而不仅仅是异常值的问题。我们提出了一种新颖的基于重新排序的量化方法RPTQ,解决了量化LLMs激活的问题。RPTQ重新排列激活中的通道,然后将它们分组量化,从而减少通道范围差异的影响。此外,通过避免显式重新排序,我们减少了存储和计算开销。通过实现这一方法,我们第一次将LLM模型推向了3位激活,取得了显著的突破。