Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. This hidden cost limits the latency improvement realized by quantizing Neural Networks. To address this, we present HAWQV3, a novel mixed-precision integer-only quantization framework. The contributions of HAWQV3 are the following: (i) An integer-only inference where the entire computational graph is performed only with integer multiplication, addition, and bit shifting, without any floating point operations or even integer division; (ii) A novel hardware-aware mixed-precision quantization method where the bit-precision is calculated by solving an integer linear programming problem that balances the trade-off between model perturbation and other constraints, e.g., memory footprint and latency; (iii) Direct hardware deployment and open source contribution for 4-bit uniform/mixed-precision quantization in TVM, achieving an average speed up of $1.45\times$ for uniform 4-bit, as compared to uniform 8-bit for ResNet50 on T4 GPUs; and (iv) extensive evaluation of the proposed methods on ResNet18/50 and InceptionV3, for various model compression levels with/without mixed precision. For ResNet50, our INT8 quantization achieves an accuracy of $77.58\%$, which is $2.68\%$ higher than prior integer-only work, and our mixed-precision INT4/8 quantization can reduce INT8 latency by $23\%$ and still achieve $76.73\%$ accuracy. Our framework and the TVM implementation have been open sourced.
翻译:当前的低精度定量算法往往隐藏着从浮动点向浮动点、增量和位移的转换成本。 这种隐藏成本限制了通过神经网络量化实现的延迟度改进。 为了解决这个问题,我们提出了HAWQV3, 这是一种新颖的混合精度整形整形图度框架。 HAWQV3 的贡献如下:(一) 整个计算图仅以整数倍增量、增量和位移位方式进行转换,而没有任何浮动点操作或甚至整数分化;(二) 一种新型的硬件觉知混合精度精度精度方法,通过解决整数线性编程程序问题,平衡模型渗透和其他限制(例如,记忆足迹和通量)之间的交易。 (三) 直接部署硬件和开放源,用于4比整数倍的元整数倍增量、增量、增量和零比值平均四百分数的I-百分数精度精度精度精度精度计算方法。