Quantization is wildly taken as a model compression technique, which obtains efficient models by converting floating-point weights and activations in the neural network into lower-bit integers. Quantization has been proven to work well on convolutional neural networks and transformer-based models. Despite the decency of these models, recent works have shown that MLP-based models are able to achieve comparable results on various tasks ranging from computer vision, NLP to 3D point cloud, while achieving higher throughput due to the parallelism and network simplicity. However, as we show in the paper, directly applying quantization to MLP-based models will lead to significant accuracy degradation. Based on our analysis, two major issues account for the accuracy gap: 1) the range of activations in MLP-based models can be too large to quantize, and 2) specific components in the MLP-based models are sensitive to quantization. Consequently, we propose to 1) apply LayerNorm to control the quantization range of activations, 2) utilize bounded activation functions, 3) apply percentile quantization on activations, 4) use our improved module named multiple token-mixing MLPs, and 5) apply linear asymmetric quantizer for sensitive operations. Equipped with the abovementioned techniques, our Q-MLP models can achieve 79.68% accuracy on ImageNet with 8-bit uniform quantization (model size 30 MB) and 78.47% with 4-bit quantization (15 MB).
翻译:量化被野生地作为一种模型压缩技术,通过将浮点重量和神经网络的激活转换成低比位整数而获得高效模型。量化已证明对进化神经网络和变压器模型行之有效。尽管这些模型很体面,但最近的工程显示,基于MLP的模型能够在从计算机视野、NLP到3D点云等各种任务上取得可比结果,同时由于平行性和网络简单性而实现更高的通过量。然而,正如我们在纸上显示的那样,直接将量化到基于 MLP 的神经网络中,将会导致显著的准确性退化。根据我们的分析,量化对精确性差距的两个主要问题说明:(1) 以MLP为基础的模型的启动范围可能太大,无法进行量化,(2) 以MLP为基础的模型的具体组成部分对四分化十分敏感。 因此,我们提议(1) 应用TillNorm来控制启动的量化范围,(2) 利用约束性激活功能,(3) 将百分位数模型应用于基于QiP的最小性精度模型,在启动过程中应用了我们称为IMMMML的模型。