Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. Uniform post-training quantization (PTQ) methods are common, since they can be implemented efficiently in hardware and do not require extensive hardware resources or a training set. Mapping FP32 models to INT8 using uniform PTQ yields models with negligible accuracy degradation; however, reducing precision below 8 bits with PTQ is challenging, as accuracy degradation becomes noticeable, due to the increase in quantization noise. In this paper, we propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. 4-bit quantization, for example, is employed by dynamically examining the bits of 8-bit values and choosing a window of 4 bits, while first skipping zero-value bits. Moreover, instead of quantizing activation-by-activation to 4 bits, we focus on pairs of 8-bit activations and examine whether one of the two is equal to zero. If one is equal to zero, the second can opportunistically use the other's 4-bit budget; if both do not equal zero, then each is dynamically quantized to 4 bits, as described. SPARQ achieves minor accuracy degradation and a practical hardware implementation. The code is available at https://github.com/gilshm/sparq.
翻译:在深神经网络(DNNS)中,量化是一种提高执行绩效和硬件效率的技术。统一后培训量化(PTQ)方法很常见,因为可以在硬件中高效实施,不需要大量硬件资源或成套培训。使用统一的PTQ产量模型,将FD32模型映射到INT8, 精确度降解可忽略不计;然而,将PTQ的精确度降低到8位以下,具有挑战性,因为由于量化噪音的增加,精确度降低到8位以下是显而易见的。在本文中,我们建议采用一种自觉量化(SPARQ)方法,在不同的代表颗粒中,可以有效利用无结构的和动态激活的加速度。例如,4位四位量化模型,通过动态检查8位值的位数,选择4位数的窗口,而首先跳过零值位位位位数。此外,由于量化激活到4位数,我们只注重8位点启动的组合,并检查两种数字中的1位的精确度是否在不同的代表点内,如果1位数字等于零位数,那么,则使用一个预算为零位数。如果等于等于等于等于等于4位数,那么,那么,则使用一个数字。