Multilingual Automatic Speech Recognition (ASR) models have extended the usability of speech technologies to a wide variety of languages. With how many languages these models have to handle, however, a key to understanding their imbalanced performance across different languages is to examine if the model actually knows which language it should transcribe. In this paper, we introduce our work on improving performance on FLEURS, a 102-language open ASR benchmark, by conditioning the entire model on language identity (LID). We investigate techniques inspired from recent Connectionist Temporal Classification (CTC) studies to help the model handle the large number of languages, conditioning on the LID predictions of auxiliary tasks. Our experimental results demonstrate the effectiveness of our technique over standard CTC/Attention-based hybrid models. Furthermore, our state-of-the-art systems using self-supervised models with the Conformer architecture improve over the results of prior work on FLEURS by a relative 28.4% CER. Trained models and reproducible recipes are available at https://github.com/espnet/espnet/tree/master/egs2/fleurs/asr1 .
翻译:多语言自动语音识别(ASR)模式将语音技术的可用性扩大到了多种语言,然而,由于这些模式需要处理多少种语言,理解其在不同语言之间不均衡性能的关键是检查模型是否实际知道它应该转录哪种语言。在本文中,我们介绍我们改进FLEURS的工作,FLEURS是一个102种语言开放的ASR基准,对整个语言身份模式(LID)进行调整。我们调查了最近对连接时间分类(CTC)研究的启发技术,以帮助模型处理大量语言,以LID对辅助任务的预测为条件。我们的实验结果表明,我们的技术比标准的CTC/Atvention-基于混合模式有效。此外,我们利用与Conforent结构的自我监督模式,利用相对28.4%的CER改进了FLEURS先前工作的结果。在https://github.com/espnet/espnet/stree/master/ge2/friurs/asr1中,我们最先进的系统。</s>