The post-training quantization (PTQ) challenge of bringing quantized neural net accuracy close to original has drawn much attention driven by industry demand. Many of the methods emphasize optimization of a specific degree-of-freedom (DoF), such as quantization step size, preconditioning factors, bias fixing, often chained to others in multi-step solutions. Here we rethink quantized network parameterization in HW-aware fashion, towards a unified analysis of all quantization DoF, permitting for the first time their joint end-to-end finetuning. Our single-step simple and extendable method, dubbed quantization-aware finetuning (QFT), achieves 4-bit weight quantization results on-par with SoTA within PTQ constraints of speed and resource.
翻译:培训后量化(PTQ)挑战,即使量化神经网精确度接近原始神经网精确度,引起了产业需求的大量关注。许多方法强调优化特定自由度(DoF),如量化步骤大小、先决条件因素、偏向修正,往往在多步解决方案中与他人连锁在一起。我们在这里重新思考以HW-觉醒方式对四分化网络参数化的思考,以统一分析所有定量化神经网精确度,首次允许它们联合端到端微调。我们单步简单、可扩展的方法,即所谓的量化-觉醒微调(QFT),在速度和资源限制范围内实现四位四位数的加权量化结果。