This paper describes the THUEE team's speech recognition system for the IARPA Open Automatic Speech Recognition Challenge (OpenASR21), with further experiment explorations. We achieve outstanding results under both the Constrained and Constrained-plus training conditions. For the Constrained training condition, we construct our basic ASR system based on the standard hybrid architecture. To alleviate the Out-Of-Vocabulary (OOV) problem, we extend the pronunciation lexicon using Grapheme-to-Phoneme (G2P) techniques for both OOV and potential new words. Standard acoustic model structures such as CNN-TDNN-F and CNN-TDNN-F-A are adopted. In addition, multiple data augmentation techniques are applied. For the Constrained-plus training condition, we use the self-supervised learning framework wav2vec2.0. We experiment with various fine-tuning techniques with the Connectionist Temporal Classification (CTC) criterion on top of the publicly available pre-trained model XLSR-53. We find that the frontend feature extractor plays an important role when applying the wav2vec2.0 pre-trained model to the encoder-decoder based CTC/Attention ASR architecture. Extra improvements can be achieved by using the CTC model finetuned in the target language as the frontend feature extractor.
翻译:本文介绍THUEE团队对国际农业产权协会开放自动语音识别挑战的语音识别系统(OpenASR21),并进行进一步的实验探索。我们在受约束和受约束培训附加培训条件下取得了杰出的成果。对于受约束培训条件,我们根据标准的混合结构构建了基本ASR系统。为了缓解校外(OOOOV)问题,我们用Gapeme-to-Phoneme(G2P)技术扩展了用于OOOO和潜在新词的读音能力语言识别系统(OOV)和潜在新词的读音能力测试系统。我们发现,在应用基于ASV2的升级前方结构时,将前端功能提取器作为重要的角色。我们使用自我监督的学习框架 wav2vec2.0.0。我们尝试各种微调技术,在公开使用的事先培训模式XLSR-2-53中,我们发现前端功能提取器在应用基于AS2的升级前方结构时,可以将STRAFSDSFSFSF作为前改进目标。