In this paper, we propose a fully differentiable quantization method for vision transformer (ViT) named as Q-ViT, in which both of the quantization scales and bit-widths are learnable parameters. Specifically, based on our observation that heads in ViT display different quantization robustness, we leverage head-wise bit-width to squeeze the size of Q-ViT while preserving performance. In addition, we propose a novel technique named switchable scale to resolve the convergence problem in the joint training of quantization scales and bit-widths. In this way, Q-ViT pushes the limits of ViT quantization to 3-bit without heavy performance drop. Moreover, we analyze the quantization robustness of every architecture component of ViT and show that the Multi-head Self-Attention (MSA) and the Gaussian Error Linear Units (GELU) are the key aspects for ViT quantization. This study provides some insights for further research about ViT quantization. Extensive experiments on different ViT models, such as DeiT and Swin Transformer show the effectiveness of our quantization method. In particular, our method outperforms the state-of-the-art uniform quantization method by 1.5% on DeiT-Tiny.
翻译:在本文中,我们提出了名为Q- ViT的视觉变压器(Vit)完全不同的量化方法,其中量化尺度和位宽都是可学习的参数。具体地说,根据我们关于ViT头显示不同量化稳健度的观察,我们利用头智能位宽度来挤压Q- ViT的大小,同时保持性能。此外,我们提出了一种新型技术,名为可切换规模,以解决在对定量尺度和位宽进行联合培训时的趋同问题。通过这种方式,Q- ViT将ViT四分化的限度推到3位,而没有大幅的性能下降。此外,我们分析了ViT的每个结构组成部分的量化稳健度,并表明多头自省(MSA)和高斯错误线单位(GELU)是ViT量化的关键方面。本研究为关于Vit四分化的进一步研究提供了一些见解。在不同的ViT模型上进行了广泛的实验,例如DetiT和Systemal-st tystal化方法。