Recently, representation learning for text and speech has successfully improved many language related tasks. However, all existing methods suffer from two limitations: (a) they only learn from one input modality, while a unified representation for both speech and text is needed by tasks such as end-to-end speech translation, and as a result,(b) they can not exploit various large-scale text and speech data and their performance is limited by the scarcity of parallel speech translation data.To address these problems, we propose a Fused Acoustic and Text Masked Language Model (FAT-MLM) which jointly learns a unified representation for both acoustic and text input from various types of corpora including parallel data for speech recognition and machine translation, and even pure speech and text data. Within this cross-modal representation learning framework, we further present an end-to-end model for Fused Acoustic and Text Speech Translation (FAT-ST). Experiments on three translation directions show that by fine-tuning from FAT-MLM, our proposed speech translation models substantially improve translation quality by up to +5.9 BLEU.
翻译:最近,对文本和语言的代言学习成功地改善了与语言有关的许多任务,但是,所有现有方法都受到两个限制:(a) 仅从一种输入模式学习,而语言翻译等任务需要对语言和文字的统一表述,因此,(b) 它们不能利用各种大规模文本和语言数据,其性能因缺少平行语言翻译数据而受到限制。 为了解决这些问题,我们建议采用一个“FUT Avolicistic ”和“文本遮盖语言模型(FAT-MLMMM)”,它共同学习了来自各类社团的声学和文字投入的统一表述,包括语音识别和机器翻译的平行数据,甚至纯语言和文字数据。在这个跨模式的代言语学习框架内,我们进一步提出了“FAT-Speak”和语言翻译的端对端模式(FAT-MLMM),关于三个翻译方向的实验表明,通过对FATT-MLMM的微调,我们提议的语音翻译模式大大改进了翻译质量,达到+5.9 BLEU。