Transformer architecture has become the fundamental element of the widespread natural language processing~(NLP) models. With the trends of large NLP models, the increasing memory and computation costs hinder their efficient deployment on resource-limited devices. Therefore, transformer quantization attracts wide research interest. Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. However, their proposed methods increase the computation overhead and still leave the outliers there. To fundamentally address this problem, this paper delves into the inherent inducement and importance of the outliers. We discover that $\boldsymbol \gamma$ in LayerNorm (LN) acts as a sinful amplifier for the outliers, and the importance of outliers varies greatly where some outliers provided by a few tokens cover a large area but can be clipped sharply without negative impacts. Motivated by these findings, we propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. The Gamma Migration migrates the outlier amplifier to subsequent modules in an equivalent transformation, contributing to a more quantization-friendly model without any extra burden. The Token-Wise Clipping takes advantage of the large variance of token range and designs a token-wise coarse-to-fine pipeline, obtaining a clipping range with minimal final quantization loss in an efficient way. This framework effectively suppresses the outliers and can be used in a plug-and-play mode. Extensive experiments prove that our framework surpasses the existing works and, for the first time, pushes the 6-bit post-training BERT quantization to the full-precision (FP) level. Our code is available at https://github.com/wimh966/outlier_suppression.
翻译:变异器架构已成为广泛自然语言处理~( NLP) 模型的基本元素。 随着大型 NLP 模型的趋势, 不断增长的内存和计算成本阻碍了其在资源有限的装置上的有效部署。 因此, 变异器量化吸引了广泛的研究兴趣。 最近的工作承认, 结构外端是量化绩效的关键瓶颈。 但是, 他们建议的方法增加了计算管理费用, 仍然会让外端留下。 为了从根本上解决这个问题, 本文进入了外部端点的内在诱导和重要性。 我们发现, 在TelmNorm (LN) 中, 不断增长的内存和计算成本成本成本成本成本成本成本, 使外端部分外端的外端成为外端放大器。 以一些外端表示的形式覆盖了大面积的外端, 但是可以避免负面影响。 由于这些发现, 我们提议了一个外部抑制框架, 包括两个组成部分: Gamma 迁移和 Token-Wislisteing 。 Gammamo- developeral deal developal deal deal deal deal deal deal deal deal dealalalalal dealalal comstal comlistration comstal comlivestistration (Gamment) ortistration ortistrate) ortisteal romod. ral a commo rod romodal romodal commodal commodal commodal- romodal commodal romodaldal romod) rodaldaldaldaldaldald rod orts ordaldaldaldaldald rodaldaldaldaldal rodal rodal rod rod. rod rod. commod. rodaldaldaldaldal rodal rod rod rod rod rod rod rod rod rodal rodal rod rod rodaldaldaldal rod rodal commod ro