Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue. We take advantage of a recently proposed speech-to-unit translation (S2UT) framework that encodes target speech into discrete representations, and transfer pre-training and efficient partial finetuning techniques that work well for speech-to-text translation (S2T) to the S2UT domain by studying both speech encoder and discrete unit decoder pre-training. Our experiments on Spanish-English translation show that self-supervised pre-training consistently improves model performance compared with multitask learning with an average 6.6-12.1 BLEU gain, and it can be further combined with data augmentation techniques that apply MT to create weakly supervised training data. Audio samples are available at: https://facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html .
翻译:直接语音对语音翻译(S2ST)模型存在数据稀缺问题,因为与由自动语音识别(ASR)、机器翻译(MT)和文本对语音合成(TTS)组成的常规级联系统现有数据数量相比,S2ST数据几乎不存在平行数据,而传统级联系统的数据数量则包括自动语音识别(ASR)、机器翻译(MT)和文本对语音合成(TTS)等。在这项工作中,我们探索以无标签语音数据和数据扩增进行自我监督的预培训,以解决这一问题。我们利用最近提出的将目标语音对单位翻译(S2UT)编码成离散演示(S2UT)的框架,将培训前和高效的部分微调技术转移到S2UT域,这些技术通过对语音对文本翻译(S2T)进行自动识别、机器翻译(S2T)产生良好效果,同时研究语音编码编码和独立单位解析器分解码器。我们在西班牙语和英语翻译前的实验显示,与多任务学习相比,平均6.12.1 BLEU的收益,还可以进一步结合数据增强技术,并将数据增强技术应用MT用于创建薄弱的培训数据库。