Inference time, model size, and accuracy are critical for deploying deep neural network models. Numerous research efforts have been made to compress neural network models with faster inference and higher accuracy. Pruning and quantization are mainstream methods to this end. During model quantization, converting individual float values of layer weights to low-precision ones can substantially reduce the computational overhead and improve the inference speed. Many quantization methods have been studied, for example, vector quantization, low-bit quantization, and binary/ternary quantization. This survey focuses on ternary quantization. We review the evolution of ternary quantization and investigate the relationships among existing ternary quantization methods from the perspective of projection function and optimization methods.
翻译:测算时间、模型大小和精确度对于部署深神经网络模型至关重要。 已经为压缩神经网络模型做出了许多研究努力,以更快的推论和更高的精确度。 审慎和量化是这方面的主流方法。 在模型量化过程中,将层重的单个浮点值转换成低精度的浮点数可以大幅降低计算间接费用,提高推论速度。 许多测算方法已经研究过,例如矢量量化、低位四分制和二元/永久四分制。本调查侧重于长期量化。我们从投影功能和优化方法的角度,审视了定点量化的演变,并调查了现有定点量化方法之间的关系。</s>