Network quantization significantly reduces model inference complexity and has been widely used in real-world deployments. However, most existing quantization methods have been developed mainly on Convolutional Neural Networks (CNNs), and suffer severe degradation when applied to fully quantized vision transformers. In this work, we demonstrate that many of these difficulties arise because of serious inter-channel variation in LayerNorm inputs, and present, Power-of-Two Factor (PTF), a systematic method to reduce the performance degradation and inference complexity of fully quantized vision transformers. In addition, observing an extreme non-uniform distribution in attention maps, we propose Log-Int-Softmax (LIS) to sustain that and simplify inference by using 4-bit quantization and the BitShift operator. Comprehensive experiments on various transformer-based architectures and benchmarks show that our Fully Quantized Vision Transformer (FQ-ViT) outperforms previous works while even using lower bit-width on attention maps. For instance, we reach 84.89% top-1 accuracy with ViT-L on ImageNet and 50.8 mAP with Cascade Mask R-CNN (Swin-S) on COCO. To our knowledge, we are the first to achieve lossless accuracy degradation (~1%) on fully quantized vision transformers. The code is available at https://github.com/megvii-research/FQ-ViT.
翻译:网络孔化显著降低了模型推断复杂性,并被广泛用于现实世界的部署。然而,大多数现有量化方法主要是在进化神经网络上开发的,在对全面量化的视觉变压器应用时会发生严重退化。在这项工作中,我们表明,许多这些困难是由于以下因素造成的:层内输入的管道严重互换,以及目前的二元化二元化变压器(PTF),这是降低性能退化和充分量化的视觉变压器复杂性的系统方法。此外,在关注地图中观测极端非统一分布,我们建议对神经神经网络进行Log-Int-Softmax(LIS),以维持这一分布,并通过使用四位四位四位四位四位四分和Bit Shift操作器简化推断。关于各种基于变压器的架构和基准的综合实验表明,我们完全量化的愿景变压器(FQ-ViT)比特(FT)比以往的工程要优得多,而在关注地图上甚至使用低位维维维维特的地图上,例如,我们达到了84.89%的顶级-一级-一级-一级变压-一级的精确,在VASS-S-L的图像网络上实现了-MA-MA-MAS-S-S-S-S-S-de-de-de-LO1上实现了。