Network quantization significantly reduces model inference complexity and has been widely used in real-world deployments. However, most existing quantization methods have been developed mainly on Convolutional Neural Networks (CNN), and suffer severe degradation when applied to fully quantized vision transformers. In this work, we demonstrate that many of these difficulties arise because of serious inter-channel variation in LayerNorm inputs, and present, Power-of-Two Factor (PTF), a systematic method to reduce the performance degradation and inference complexity of fully quantized vision transformers. In addition, observing an extreme non-uniform distribution in attention maps, we propose Log-Int-Softmax (LIS) to sustain that and simplify inference by using 4-bit quantization and the BitShift operator. Comprehensive experiments on various transformer-based architectures and benchmarks show that our Fully Quantized Vision Transformer (FQ-ViT) outperforms previous works while even using lower bit-width on attention maps. For instance, we reach 84.89% top-1 accuracy with ViT-L on ImageNet and 50.8 mAP with Cascade Mask R-CNN (Swin-S) on COCO. To our knowledge, we are the first to achieve lossless accuracy degradation (~1%) on fully quantized vision transformers. Code is available at https://github.com/linyang-zhh/FQ-ViT.
翻译:网络定量化极大地降低了模型的推断复杂性,并被广泛用于现实世界的部署。然而,大多数现有量化方法主要是在进化神经网络(CNN)上开发的,在对全面量化的视觉变压器应用时会发生严重退化。在这项工作中,我们证明,许多这些困难是由于以下因素造成的:层内输入和目前的二元动力变压器(PTF)中出现严重的气道间变异,这是降低性能退化和充分量化的视觉变压器复杂性的系统方法。此外,在关注地图中看到极端的非统一分布,我们建议Log-Int-Softmax(LIS)使用四位四位四位四位四位四分化和BitShift操作器,以简化推断。关于各种基于变压器的架构和基准的全面实验表明,我们完全量化的视野变压器(FQ-VT)比以往的工程要快得多,而在关注地图上甚至使用较低的位维维维维度图。例如,我们达到了84.89%的顶级-一级-一级变压系统SAS-L在图像S-MAS-S-L QAR-de-deal-de-del-de-de-deal-deal-I-deal-S-S-deal-dexxx