We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.
翻译:我们把端到端语音识别的最新进展与非偏向自动语音识别结合起来。 我们推出多个数据集的非偏向状态结果限制: LibriSpeech, Fisher+Switchboard 和 Wall Street Journal。 我们的食谱关键是,我们利用CTC 和 SpecAugment 和 wav2vec2 培训前的巨型神经网络结构。 我们在LibriSpeech 测试/其他数据集上实现了1.8%/3.6%的WER,在交换机上实现了5.1%/9.8%的WER,在华尔街日报上实现了3.4%的WER, 全部没有语言模式。