End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks. However, these models perform poorly on edge hardware due to large memory and computation requirements. While quantizing model weights and/or activations to low-precision can be a promising solution, previous research on quantizing ASR models is limited. Most quantization approaches use floating-point arithmetic during inference; and thus they cannot fully exploit integer processing units, which use less power than their floating-point counterparts. Moreover, they require training/validation data during quantization for finetuning or calibration; however, this data may not be available due to security/privacy concerns. To address these limitations, we propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models. In particular, we generate synthetic data whose runtime statistics resemble the real data, and we use it to calibrate models during quantization. We then apply Q-ASR to quantize QuartzNet-15x5 and JasperDR-10x5 without any training data, and we show negligible WER change as compared to the full-precision baseline models. For INT8-only quantization, we observe a very modest WER degradation of up to 0.29%, while we achieve up to 2.44x speedup on a T4 GPU. Furthermore, Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
翻译:终端到终端神经网络模型在各种自动语音识别(ASR)任务上取得了更好的表现。然而,这些模型由于大量内存和计算要求,在边缘硬件上表现不佳。虽然对模型重量和/或低精度启动量进行量化可能是有希望的解决方案,但以往关于对ASR模型进行量化的研究是有限的。大多数定量方法在推断时使用浮点算法;因此无法充分利用整数处理器,这些处理器的功率比浮点对等器要低。此外,这些模型在微调或校准时需要培训/验证数据;然而,这些数据可能无法在边缘硬件硬件硬件上运行;为克服这些局限性,我们提议对低精度的ASR模型采用Q-ASR(仅限整数,零点四分数四分数的四分数),然后对QA-ARTNet-155和ESDR-10x5进行量化,而无需任何培训数据,我们提出的QA-ASR(四分级)大规模递增速度,我们提出的WER(四分级)的快速降解为0.8)基线数据,然后再用微的VER(我们观测的四分级的四分级的四分级的四分级),对二至四分级的四分级的四分级的四分级的四分级的数值,对四分级的大规模的精确的数值,对四分数,对二至四级的精确的精确的精确的精确的精确的精确度数据,对二等的精确的精确的精确的精确度,对二至四级的精确度,对四级,对四级,对二比。