We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5-10s) such that the embedding vector space is semantically aligned across different languages. We combine state-of-the-art multilingual acoustic frame-level speech representation learning model XLS-R with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder SAMU-XLSR. Although we train SAMU-XLSR with only multilingual transcribed speech data, cross-lingual speech-text and speech-speech associations emerge in its learned representation space. To substantiate our claims, we use SAMU-XLSR speech encoder in combination with a pre-trained LaBSE text sentence encoder for cross-lingual speech-to-text translation retrieval, and SAMU-XLSR alone for cross-lingual speech-to-speech translation retrieval. We highlight these applications by performing several cross-lingual text and speech translation retrieval tasks across several datasets.
翻译:我们建议采用SAMU-XLSR(SAMU-XLSR):即现代的多模式多式超文本语言跨语言语言演示学习框架。我们把最新的多语种语调框架级语音代表学习模式XLS-R与语言Agnotic BERT判决嵌入嵌入模式(LABSE)的语音代表学习模式(LABSE)结合起来,以创建超文本多语种多语种语言语音解析器(10-20Ms),而这项工作的重点是学习多语种(语音-文字)多语种语音,嵌入一个句解(5-10s),使嵌入的矢量方语言空间在语言代表空间之间实现语义一致。我们把最先进的多语种语种语系语言语言语言语言语言语言语言语言语言语言语言代表(LABS)语言代表学习模型(LBES)和语言语言语言语言语言分组文本(LE)合并起来,通过多种语言语言语言语言语言语言语言语言语言语言语音检索和语言语言语言翻译(SMUMU-S-S-S-S-S-L)的跨语言跨语言语言语音检索,并进行多种语言语言语言语言语言语言语言语言语言语言语言翻译。