We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.
翻译:我们介绍了MSLAM,这是一个多语种话语和Lansguage模型,通过对多种语言的大量无标签言语和文本进行联合培训,学习跨语言跨模式的语音和文本表述。 mSLAM结合了W2v-BERT语言预培训,与SpanBERT语言级文本培训相结合,与SpanBERT语言级文本培训相结合,同时在配对语言和笔录数据方面出现了连接时间分类(CTC)损失,学习了一种单一模式,能够在共享代表空间中学习和代表语言和文本信号。我们评估了几个下游言语理解任务,发现在语言翻译、语言意向分类和语音语言标识方面,与多语种语言语言标准相比,与多语种语言标准语言语言语言标准相比,联合培训前培训提高了语言质量。我们的语言翻译模型展示了零点文本翻译,没有看到任何文本翻译数据,为跨模式的表达提供了证据。 mSLAM还得益于多模式的微调,通过在微调过程中直接利用文本翻译数据进一步提高语言翻译质量,从而建议了未来多式联运前研究中出现的若干机会和挑战。