Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.
翻译:受T5(Text-Text-Text Transaction Transformation )在经过训练的自然语言处理模型中的成功推动,我们提议了一个统一的调制解说T5框架,以探索自我监管的语音/文本代表学习的编码器-编码器前培训。T5调制T5框架包括一个共享的编码器-代码网络和六个特定模式(speech/text)的前/后网。在通过预选的网络处理输入的语音/文本、共同的编码-脱coder网络模型来模拟顺序到序列的转换,然后根据解码器输出的输出,将后网生成语音/文本格式的输出输出输出。S调制大型无标签的语音和文本数据,我们先导式演讲T5来学习统一的模式,希望改进语音和文本的建模能力。为了将语音和语音信息与这个统一的语文体空间相协调,我们提议一种跨模式的矢量矢量转换方法,在语音/文本/文本的变换语言的变换结构中,将语音/文本/文本的升级的变换格式化框架进行随机化。