State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications. Low-bit deep neural network quantization techniques provides a powerful solution to dramatically reduce their model size. Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors. To this end, novel mixed precision DNN quantization methods are proposed in this paper. The optimal local precision settings are automatically learned using two techniques. The first is based on a quantization sensitivity metric in the form of Hessian trace weighted quantization perturbation. The second is based on mixed precision Transformer architecture search. Alternating direction methods of multipliers (ADMM) are used to efficiently train mixed precision quantized DNN systems. Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system suggest the proposed mixed precision Transformer quantization techniques achieved model size compression ratios of up to 16 times over the full precision baseline with no recognition performance degradation. When being used to compress a larger full precision Transformer LM with more layers, overall word error rate (WER) reductions up to 1.7% absolute (18% relative) were obtained.
翻译:由变异器所代表的最新神经语言模型正在变得越来越复杂,而且实际应用成本越来越昂贵。低位深神经网络量化技术为大幅缩小模型大小提供了强有力的解决方案。当前低位量化方法基于统一精确度,没有考虑到系统不同部分不同性能敏感度的量化误差。为此,本文件提出了新颖的混合精密 DNNN 量化方法。使用两种技术自动学习了最佳本地精确设置。第一种是赫森痕量加权四分化的量化灵敏度指标。第二种是混合精密变异器结构搜索法。变异法乘数法(ADMMM)用于高效地训练混合精度四分解 DNNN系统。在Pen Treebank(PTB)和经过LF-MI TDNN系统培训的开关板堆实验中,建议采用混合精度变异度变异器量化技术,在完全精度基线下达到16倍的压缩比例,而没有识别性降解。在使用最大精度缩度降级比例时,将精确度降为1.7%。