End-to-end speech-to-text translation models are often initialized with pre-trained speech encoder and pre-trained text decoder. This leads to a significant training gap between pre-training and fine-tuning, largely due to the modality differences between speech outputs from the encoder and text inputs to the decoder. In this work, we aim to bridge the modality gap between speech and text to improve translation quality. We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text. While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation via modelling global and local dependencies of a speech sequence. Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU score on the Must-C En$\rightarrow$DE dataset.\footnote{Our code is available at https://github.com/mingzi151/w2v2-st.}
翻译:终端到终端语音到文本翻译模型往往先用经过培训的语音编码器和经过培训的文本解码器来初始化,这导致培训前和微调之间的培训差距很大,这主要是由于编码器的语音输出和对编码器的文本输入方式不同。在这项工作中,我们的目标是弥合语音和文本之间的模式差距,以提高翻译质量。我们提议M-Adapter,一个基于变换器的新模块,使语音表达方式适应文本。M-Adapter在缩小语音序列的同时,通过模拟全球和当地语音序列的依赖性来生成语音到文本翻译所需的功能。我们的实验结果表明,我们的模型在“Must-C$En$\rightrow$DE dataset”上超过一个强大的基准分数。\footte{我们的代码可在https://github.com/mingzi151/w2v2-st查阅 https://github.com/mingzi151/w2st。}