In this paper, we propose a three-stage training methodology to improve the speech recognition accuracy of low-resource languages. We explore and propose an effective combination of techniques such as transfer learning, encoder freezing, data augmentation using Text-To-Speech (TTS), and Semi-Supervised Learning (SSL). To improve the accuracy of a low-resource Italian ASR, we leverage a well-trained English model, unlabeled text corpus, and unlabeled audio corpus using transfer learning, TTS augmentation, and SSL respectively. In the first stage, we use transfer learning from a well-trained English model. This primarily helps in learning the acoustic information from a resource-rich language. This stage achieves around 24% relative Word Error Rate (WER) reduction over the baseline. In stage two, We utilize unlabeled text data via TTS data-augmentation to incorporate language information into the model. We also explore freezing the acoustic encoder at this stage. TTS data augmentation helps us further reduce the WER by ~ 21% relatively. Finally, In stage three we reduce the WER by another 4% relative by using SSL from unlabeled audio data. Overall, our two-pass speech recognition system with a Monotonic Chunkwise Attention (MoChA) in the first pass and a full-attention in the second pass achieves a WER reduction of ~ 42% relative to the baseline.
翻译:在本文中,我们提出一个三阶段培训方法,以提高低资源语言的语音识别准确性。 我们探索并提出一个三阶段培训方法, 以提高低资源语言的语音识别准确性。 我们提出一个三阶段培训方法, 以提高低资源意大利 ASR(SSL)的准确性。 我们提出一个三阶段培训方法, 以提高低资源语言的语言语音识别准确性。 我们提出并提议一个三阶段培训方法, 提高低资源语言的语音识别准确性。 我们使用一个三阶段培训良好的英语模式、 TTTS 扩增和 SSL 。 在第一阶段, 我们从训练有素的英语模式中传授学习。 这主要有助于从资源丰富的语言中学习声学信息。 这个阶段比基线减少约24 % 相对单词错误率(WER ) 。 在第二阶段, 我们通过 TTS 数据提示将未加标签的文本数据纳入模型中。 我们还在现阶段探索如何冻结音响编码器。 TTS 数据增强有助于我们进一步减少WER 21 % 。 最后, 在第三阶段, 我们将WER- MoA 的相对比例数据从SLA 降为另一个4 全面通过系统, 降音频系统, 降为全音频 。