To accelerate the inference of deep neural networks (DNNs), quantization with low-bitwidth numbers is actively researched. A prominent challenge is to quantize the DNN models into low-bitwidth numbers without significant accuracy degradation, especially at very low bitwidths (< 8 bits). This work targets an adaptive data representation with variable-length encoding called DyBit. DyBit can dynamically adjust the precision and range of separate bit-field to be adapted to the DNN weights/activations distribution. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization, and the proposed framework can achieve up to 8.1x speedup compared with the original model.
翻译:为了加速深神经网络(DNN)的推论,正在积极研究低比位数数字的量化。一个突出的挑战是如何将DNN模型量化为低比位数,而不显著降低精确度,特别是在极低比特(小于8位)的情况下。这项工作的目标是以可变长编码(称为DyBit)的适应性数据表示。DyBit可以动态地调整适合DNN重量/活动分布的不同位数的精确度和范围。我们还提议了一个具有混合精度加速器的硬件-识数定量框架,用以交换推算精确度和加速。实验结果显示,DyBit的推算准确度比4位位方位方位的状态高出1.997%,拟议框架可以达到与原模型相比的8.1x速度。</s>