Quantization of transformer language models faces significant challenges due to the existence of detrimental outliers in activations. We observe that these outliers are asymmetric and concentrated in specific channels. To address this issue, we propose the Outlier Suppression+ framework. First, we introduce channel-wise shifting and scaling operations to eliminate asymmetric presentation and scale down problematic channels. We demonstrate that these operations can be seamlessly migrated into subsequent modules while maintaining equivalence. Second, we quantitatively analyze the optimal values for shifting and scaling, taking into account both the asymmetric property and quantization errors of weights in the next layer. Our lightweight framework can incur minimal performance degradation under static and standard post-training quantization settings. Comprehensive results across various tasks and models reveal that our approach achieves near-floating-point performance on both small models, such as BERT, and large language models (LLMs) including OPTs, BLOOM, and BLOOMZ at 8-bit and 6-bit settings. Furthermore, we establish a new state of the art for 4-bit BERT.
翻译:摘要:Transformer 语言模型的量化面临着存在活性异常值的重大挑战。我们观察到这些异常值是不对称的,并且集中在特定通道中。为了解决这个问题,我们提出了异常值抑制+框架。首先,我们引入了通道级别的位移和缩放操作,以消除不对称展示,并缩小问题通道。我们证明这些操作可以无缝地迁移到后续模块中,同时保持等价性。其次,我们定量分析了位移和缩放的最优值,考虑了下一层权重的不对称性和量化误差。我们的轻量级框架可以在静态和标准的后训练量化设置下最小程度地降低性能。各种任务和模型的全面结果表明,我们的方法在 8 位和 6 位设置下实现了接近浮点性能,在小型模型(如 BERT)和大型语言模型(包括 OPT、BLOOM 和 BLOOMZ)上。此外,我们建立了新的 4 位 BERT 的最新状态。