State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Low-bit neural network quantization provides a powerful solution to dramatically reduce their model size. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. To this end, novel mixed precision neural network LM quantization methods are proposed in this paper. The optimal local precision choices for LSTM-RNN and Transformer based neural LMs are automatically learned using three techniques. The first two approaches are based on quantization sensitivity metrics in the form of either the KL-divergence measured between full precision and quantized LMs, or Hessian trace weighted quantization perturbation that can be approximated efficiently using matrix free techniques. The third approach is based on mixed precision neural architecture search. In order to overcome the difficulty in using gradient descent methods to directly estimate discrete quantized weights, alternating direction methods of multipliers (ADMM) are used to efficiently train quantized LMs. Experiments were conducted on state-of-the-art LF-MMI CNN-TDNN systems featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation on two tasks: Switchboard telephone speech and AMI meeting transcription. The proposed mixed precision quantization techniques achieved "lossless" quantization on both tasks, by producing model size compression ratios of up to approximately 16 times over the full precision LSTM and Transformer baseline LMs, while incurring no statistically significant word error rate increase.
翻译:以长期短期内存常态神经网络(LSTM-RNNNs)和变异器为代表的状态语言模型(LMM)正在变得越来越复杂,实际应用成本越来越昂贵。低位神经网络量化为大幅缩小模型大小提供了强大的解决方案。当前量化方法基于统一精度,没有考虑到LMS不同部分对量化错误的不同性能敏感性。为此,本文件提出了新型的精密神经网络(LMQ)量化方法。LSTM-RNN和以变异器为基础的神经系统的最佳本地精确选择正在用三种技术自动学习。前两种方法基于全精度和四分制LMMS之间测量的定量敏感度度度度,或者Hessian的微量度对量化错误过量度,而采用矩阵模型自由技术,第三种方法基于混合精度神经结构搜索。为了克服使用梯度下降方法直接估算离析离析式双端的螺旋LMNMML, 使用LLMML方向方法, 低度计算。