For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a mu-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.
翻译:对于在线自动语音识别(ASR),量化认知培训(QAT)对于在模型预测性能和效率之间实现平衡而言是无处不在的。在现有的QAT方法中,一个主要的缺点是必须预先确定和固定四分解的中央机器人。为了克服这一限制,我们引入了一个无规范、“软到硬”的压缩机制,在一个自可调适的磁法律限制空间中,使用自我调整的中央机器人,从而形成一个更简单、更通用的量化计划,称为“通用量子(GQ) ” 。我们用经常神经网络转换器(RNNN-T)和 Conorder 结构将GQ 应用于ASR 任务。不精确降解,GQ 可以将RNN-T和Condent intocent intobility bit, 和一些RNNNNT-T-T 层压缩为1位,用于快速和准确的推算。我们观察到30.73%的记忆足迹和31.75%的用户粘合器使用比8位基准,通过物理设备进行减少。