We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems. It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text). Thus, the speech translation model can learn from both unlabeled and labeled data, especially when the source-language text data is abundant. Beyond this, we present a denoising method to build a robust text encoder that can deal with both normal and noisy text data. Our system sets new state-of-the-arts on the MuST-C En-De, En-Fr, and LibriSpeech En-Fr tasks.
翻译:我们提出了一个将文本编码器引入经过培训的终端到终端语音翻译系统的方法,它增强了将一种模式(即源语言语言发言)改换到另一种模式(即源语言文本)的能力。因此,语音翻译模式可以从未加标签和标签的数据中学习,特别是当源语言文本数据丰富时。除此之外,我们还提出了一个拆卸方法,用于构建一个能够处理正常和吵闹文本数据的强大文本编码器。我们的系统设置了有关 MuST-C En-De、En-Fr和LibriSpeech En-Fr任务的最新条款。