Quantization is one of the key techniques used to make Neural Networks (NNs) faster and more energy efficient. However, current low precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. This hidden cost limits the latency improvement realized by quantizing NNs. To address this, we present HAWQV3, a novel dyadic quantization framework. The contributions of HAWQV3 are the following. (i) The entire inference process consists of only integer multiplication, addition, and bit shifting in INT4/8 mixed precision, without any floating point operations/casting or even integer division. (ii) We pose the mixed-precision quantization as an integer linear programming problem, where the bit precision setting is computed to minimize model perturbation, while observing application specific constraints on memory footprint, latency, and BOPS. (iii) To verify our approach, we develop the first open source 4-bit mixed-precision quantization in TVM, and we directly deploy the quantized models to T4 GPUs using only the Turing Tensor Cores. We observe an average speed up of $1.45\times$ for uniform 4-bit, as compared to uniform 8-bit, precision for ResNet50. (iv) We extensively test the proposed dyadic quantization approach on multiple different NNs, including ResNet18/50 and InceptionV3, for various model compression levels with/without mixed precision. For instance, we achieve an accuracy of $78.50\%$ with dyadic INT8 quantization, which is more than $4\%$ higher than prior integer-only work for InceptionV3. Furthermore, we show that mixed-precision INT4/8 quantization can be used to achieve higher speed ups, as compared to INT8 inference, with minimal impact on accuracy. For example, for ResNet50 we can reduce INT8 latency by $23\%$ with mixed precision and still achieve $76.73\%$ accuracy.
翻译:量化是使神经网络(NNS)更快、更高效能源的关键技术之一。 然而,当前低精度量化算法往往会隐藏从浮动点向浮动点向倾斜整数值的转换成本。 这种隐性成本限制了NNS量化所实现的延缓改善。 为此,我们提出了HAWQV3, 一个新的 dyadic量化框架。 HAWQV3 的贡献如下 。 HAWQV3 的贡献如下 。 (一) 整个推论过程仅包含整数倍增、添加和在 INT4/8 混合精度中位移转动。 INT4 精确度中,我们直接将混精度四分裁分裁成一个整数线性编程问题。 计算比精度设置来尽量减少模型的破坏力,同时观察对记忆足迹、粘度和BOPS(三) 验证我们的方法,我们开发了第一个开放源4比位混合精度裁分裁分解的版本。 我们直接将精度50级的精度模型用于直径直径直径直地,我们只能用来测量了TVMR4 。