We present Mu$^{2}$SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu$^{2}$SLAM trains the speech-text models with a sequence-to-sequence masked denoising objective similar to T5 on the decoder and a masked language modeling (MLM) objective on the encoder, for both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoST AST, Mu$^{2}$SLAM establishes a new state-of-the-art for models trained on public datasets, improving on xx-en translation over the previous best by 1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR, our model matches the performance of an mSLAM model fine-tuned with an RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6\% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.
翻译:我们用100多种语言展示一个多语种顺序到顺序的模型Mu$%2}SLAM,这是一个多语种序列到顺序的模型,在未贴标签的语音识别、未贴标签的文本和监督下的数据方面,在未贴标签的语音识别、自动语音翻译和机器翻译(MT)方面,在未贴标签的语音识别(ASR)、自动语音翻译(AST)和跨模式翻译(MT)方面,在100多种语言中,我们用一个量化的语音表达方式作为目标,Mu$%2}SLAM对语音文本模型进行培训,用一个从顺序到顺序的隐蔽式解密目标,类似于关于解码的T5(MLMMM)目标,在未贴标签的语音和文本的编码上,在未贴标签的语音标码上,我们关于超语言和跨模式的模型的模型,比我们相对更接近的MSLISA的版本, 更接近了我们相对更精确的版本的版本,在SLIS-RA上,在更接近一个更接近一个更精确的版本的模型上,在不断改进的模型上更接近一个更精确的版本的版本的版本,在不断改进的模型上,在不断改进的版本上,在不断改进的版本的文本的文本的模型上,在改进的模型上,在不断改进的文本的操作的操作的操作的操作的进度上,在比的模型上,在比的改进的操作的操作的模型上,在较接近一个 mSLIS-MAMAMAMA。