We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model (acoustic encoder and language model) and achieve near-iso-accuracy. We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance while limiting the computational overhead of QAT. Density ratio Language Model fusion has shown remarkable accuracy gains on RNN-T workloads but it severely increases the computational cost of inference. We show that our quantization strategies enable using large beam widths for hypothesis search while achieving streaming-compatible runtimes and a full model compression ratio of 7.6$\times$ compared to the full precision model. Via hardware simulations, we estimate a 3.4$\times$ acceleration from FP16 to INT4 for the end-to-end quantized RNN-T inclusive of LM fusion, resulting in a Real Time Factor (RTF) of 0.06. On the NIST Hub5 2000, Hub5 2001, and RT-03 test sets, we retain most of the gains associated with LM fusion, improving the average WER by $>$1.5%.
翻译:我们报告侵略性的量化战略,这些战略大大加快了经常神经网络转换器(RNN-T)的推断速度,大大加快了经常神经网络转换器(RNN-T)的计算速度。我们使用四位位整数表示权重和激活,并应用量化认知培训(QAT)对全模型(声学编码器和语言模型)进行再培训,并实现近于偏差的精确度。我们显示,根据网络的当地特性量身定制的定制量化方案对于取得良好业绩至关重要,同时限制QAT的计算间接费用。 密度比率语言模型组合显示,RNNNT工作量的精确度取得了显著的提高,但大大提高了推断的计算成本。我们显示,我们的量化战略能够使用大宽度来进行假设搜索,同时实现流相兼容的运行时间和完全模型压缩比率7.6美元,而与整个精确模型相比,我们估计,在最后至最后的四分盘化的 RNNNT-T工作量中提高了3倍的精确度,但大大增加了计算成本。 我们的量化战略能够使用大宽度宽度宽度搜索,在2001年的中提高了中, IM IM 5 的中将 的升级的升级的升级的中, 。