Recently low-precision deep learning accelerators (DLAs) have become popular due to their advantages in chip area and energy consumption, yet the low-precision quantized models on these DLAs bring in severe accuracy degradation. One way to achieve both high accuracy and efficient inference is to deploy high-precision neural networks on low-precision DLAs, which is rarely studied. In this paper, we propose the PArallel Low-precision Quantization (PalQuant) method that approximates high-precision computations via learning parallel low-precision representations from scratch. In addition, we present a novel cyclic shuffle module to boost the cross-group information communication between parallel low-precision groups. Extensive experiments demonstrate that PalQuant has superior performance to state-of-the-art quantization methods in both accuracy and inference speed, e.g., for ResNet-18 network quantization, PalQuant can obtain 0.52\% higher accuracy and 1.78$\times$ speedup simultaneously over their 4-bit counter-part on a state-of-the-art 2-bit accelerator. Code is available at \url{https://github.com/huqinghao/PalQuant}.
翻译:最近,低精密深度学习加速器(DLAs)因其在芯片面积和能源消耗方面的优势而变得受欢迎,然而,这些DLA的低精度量化模型带来了严重的精度降解。实现高精度和高效推断的一种方法是在低精度DLAs上安装高精度神经网络,这是很少研究的。在本文中,我们提议采用Parall低精度量化(PalQaut)方法,通过从零到零学习平行低精度表示法,接近高精度计算。此外,我们提出了一个新型的环球抖动模块,以推进平行低精度组之间的跨组信息沟通。广泛的实验表明,PalQuant在精度和推力速度方面,例如,对于ResNet-18网络的量化,PalQuantt能够从一个州级的4+Q+Q/Qaut 快速调调取0.52美元和1.78美元。