Vision transformers have recently gained great success on various computer vision tasks; nevertheless, their high model complexity makes it challenging to deploy on resource-constrained devices. Quantization is an effective approach to reduce model complexity, and data-free quantization, which can address data privacy and security concerns during model deployment, has received widespread interest. Unfortunately, all existing methods, such as BN regularization, were designed for convolutional neural networks and cannot be applied to vision transformers with significantly different model architectures. In this paper, we propose PSAQ-ViT, a Patch Similarity Aware data-free Quantization framework for Vision Transformers, to enable the generation of "realistic" samples based on the vision transformer's unique properties for calibrating the quantization parameters. Specifically, we analyze the self-attention module's properties and reveal a general difference (patch similarity) in its processing of Gaussian noise and real images. The above insights guide us to design a relative value metric to optimize the Gaussian noise to approximate the real images, which are then utilized to calibrate the quantization parameters. Extensive experiments and ablation studies are conducted on various benchmarks to validate the effectiveness of PSAQ-ViT, which can even outperform the real-data-driven methods.
翻译:视觉变异器最近在各种计算机视觉任务中取得了巨大成功;然而,其高模型复杂性使得在资源限制的装置上部署具有挑战性。量化是一种减少模型复杂性的有效方法,在模型部署期间可以解决数据隐私和安全问题的无数据量化得到了广泛的关注。不幸的是,所有现有方法,如BN正规化,都是为进化神经网络设计的,无法应用于具有显著不同模型结构的视觉变异器。在本文件中,我们提议为愿景变异器设计一个“SAQ-VT”,即一个完全相似的无数据量化框架,用于降低模型复杂性,以及根据愿景变异器为校准四分化参数的独特特性生成“现实”样本。具体地说,我们分析自用模块的特性,并揭示其处理高斯噪音和真实图像方面的总体差异(相似性)。以上见解指导我们设计一个相对价值的衡量标准,以优化高斯噪音以近似于真实图像,然后用于校准定量变现参数,然后用于校准真实图像参数的“现实性”样本。对质化模型进行了广泛的实验,并用各种方法来验证。