We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts). Using a 4-bit integer representation, a na\"ive quantization approach applied to the LSTM portion of these models results in significant Word Error Rate (WER) degradation. On the other hand, we show that minimal accuracy loss is achievable with an appropriate choice of quantizers and initializations. In particular, we customize quantization schemes depending on the local properties of the network, improving recognition performance while limiting computational time. We demonstrate our solution on the Switchboard (SWB) and CallHome (CH) test sets of the NIST Hub5-2000 evaluation. DBLSTM-HMMs trained with 300 or 2000 hours of SWB data achieves $<$0.5% and $<$1% average WER degradation, respectively. On the more challenging RNN-T models, our quantization strategy limits degradation in 4-bit inference to 1.3%.
翻译:我们调查了两个以LSTM为基础的大型自动语音识别系统(ASR)大型LSTM结构(LSTM-隐藏的马克夫模型(DBLSTM-HMMS)和经常性神经网络-传感器(RNNT-Ts)两个家庭对重力和启动的进取性低精度表示法的影响。我们使用4位整数表示法,对这些模型中LSTM部分应用的“NA”和“ive”四分制法,导致LSTM值错误率(WER)大幅下降。另一方面,我们表明,通过适当选择量化器和初始化,可以实现最小精度损失。特别是,我们根据网络的当地特性定制量化方案,提高识别性,同时限制计算时间。我们在SWB交换机(SWB)和CallHome(CH)上展示了我们的解决方案。DBLTMTM-MM在300小时或2000小时的SWB数据中接受了300美元或2000小时的培训。另一方面,我们表明,通过适当选择量化和1%平均的WER降解。在1.3位模型中更具挑战性的RNNET降解限制。