Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus. However, little guidance exists on deploying these methods to improve production ASR systems that are trained on very large supervised corpora and with realistic requirements like a constrained model size and CPU budget, streaming capability, and a rich lattice for rescoring and for downstream NLU tasks. In this work, we compare three state-of-the-art semi-supervised methods encompassing both unpaired text and audio as well as several of their combinations in a controlled setting using joint training. We find that in our setting these methods offer many improvements beyond raw WER, including substantial gains in tail-word WER, decoder computation during inference, and lattice density.
翻译:摘要:在大规模有标注数据缺失的情况下,非配对文本和音频注入已成为提高语音识别性能的主要方法之一。然而,对于受到实际要求限制(如模型大小和CPU预算、流式能力、用于重新计算和下游NLU任务的丰富网格)的生产语音识别系统,如何部署这些方法来提高系统性能还缺乏指导。本文针对这一问题,在控制实验中比较了三种最先进的半监督方法,包括非配对文本和音频以及它们的几种组合,以及联合训练方法。实验发现,这些方法在我们的设置下不仅可以显著提高原始语音误识率(WER),而且还可以带来许多其他的改进,包括尾部词WER明显下降,解码器推断期间的计算量减少,以及更加密集的识别结果网络。