End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it easy to use a single model for both multilingual ASR and many-to-many ST. In this paper, we propose streaming language-agnostic multilingual speech recognition and translation using neural transducers (LAMASSU). To enable multilingual text generation in LAMASSU, we conduct a systematic comparison between specified and unified prediction and joint networks. We leverage a language-agnostic multilingual encoder that substantially outperforms shared encoders. To enhance LAMASSU, we propose to feed target LID to encoders. We also apply connectionist temporal classification regularization to transducer training. Experimental results show that LAMASSU not only drastically reduces the model size but also outperforms monolingual ASR and bilingual ST models.
翻译:自动语音识别(ASR)和语音翻译(ST)的端对端配制使多语种 ASR 和多对多个ST 使用单一模式变得容易。 在本文中,我们提议使用神经传导器(LAMASU)进行流语言不可知多语种语音识别和翻译(LAMASU)。为了在LASASU中进行多语种文本生成,我们系统地比较了具体和统一的预测和联合网络。我们利用一种语言不可识别的多语种编码器,大大优于共享的编码器。为了加强LAMASSU,我们提议将目标LID喂给编码器。我们还将连接器时间分类正规化用于导体培训。实验结果表明,LAMASSU不仅大幅缩小了模型的大小,而且超过了单一语言的ASR和双语ST模型。