We introduce a Power-of-Two post-training quantization( PTQ) method for deep neural network that meets hardware requirements and does not call for long-time retraining. PTQ requires a small set of calibration data and is easier for deployment, but results in lower accuracy than Quantization-Aware Training( QAT). Power-of-Two quantization can convert the multiplication introduced by quantization and dequantization to bit-shift that is adopted by many efficient accelerators. However, the Power-of-Two scale has fewer candidate values, which leads to more rounding or clipping errors. We propose a novel Power-of-Two PTQ framework, dubbed RAPQ, which dynamically adjusts the Power-of-Two scales of the whole network instead of statically determining them layer by layer. It can theoretically trade off the rounding error and clipping error of the whole network. Meanwhile, the reconstruction method in RAPQ is based on the BN information of every unit. Extensive experiments on ImageNet prove the excellent performance of our proposed method. Without bells and whistles, RAPQ can reach accuracy of 65% and 48% on ResNet-18 and MobileNetV2 respectively with weight INT2 activation INT4. We are the first to propose PTQ for the more constrained but hardware-friendly Power-of-Two quantization and prove that it can achieve nearly the same accuracy as SOTA PTQ method. The code will be released.
翻译:我们为深神经网络引入一种符合硬件要求且不要求长期再培训的2级后培训量化法(PTQ),该方法满足硬件要求,不需要长期再培训。PTQ需要一小套校准数据,而且更容易部署,但比量化-软件培训(QAT)的精度要低,但结果比量化-软件培训(QAT)的精度要低。2级的功率可以将量化和分解带来的倍增转换为许多高效加速器采用的位移法。然而,2级的动力-2级的候选值较少,导致更圆或剪裁错误。我们提出了新型的2级PTQ框架,称为RAPQ,它动态地调整了整个网络的2级-2级的功率比例,而不是静态地按层来决定。从理论上将整个网络的圆差差和缩略错误转换为零位。同时,RAPQ的重建方法以每个单元的BNP信息为基础。图像网络的广泛实验可以证明我们的拟议方法的精度达到Q的精度,几乎为PT-18级的精度。