Multilingual speech recognition with supervised learning has achieved great results as reflected in recent research. With the development of pretraining methods on audio and text data, it is imperative to transfer the knowledge from unsupervised multilingual models to facilitate recognition, especially in many languages with limited data. Our work investigated the effectiveness of using two pretrained models for two modalities: wav2vec 2.0 for audio and MBART50 for text, together with the adaptive weight techniques to massively improve the recognition quality on the public datasets containing CommonVoice and Europarl. Overall, we noticed an 44% improvement over purely supervised learning, and more importantly, each technique provides a different reinforcement in different languages. We also explore other possibilities to potentially obtain the best model by slightly adding either depth or relative attention to the architecture.
翻译:在最近的研究中,随着音频和文本数据培训前方法的开发,必须转让未经监督的多语种模式的知识,以促进承认,特别是在数据有限的许多语言中。我们的工作调查了两种模式使用两种预先培训的模式的有效性:Wav2vec 2.0用于音频,MBART50用于文本,以及适应性加权技术,以大幅提高包含通用语音和Europarl的公开数据集的识别质量。总体而言,我们注意到,与纯受监督的学习相比,44%的改进,更重要的是,每种技术以不同语言提供不同的强化。我们还探索其他可能性,通过稍微增加深度或相对关注来获得最佳模式。