Direct speech-to-speech translation (S2ST) is among the most challenging problems in the translation paradigm due to the significant scarcity of S2ST data. While effort has been made to increase the data size from unlabeled speech by cascading pretrained speech recognition (ASR), machine translation (MT) and text-to-speech (TTS) models; unlabeled text has remained relatively under-utilized to improve S2ST. We propose an effective way to utilize the massive existing unlabeled text from different languages to create a large amount of S2ST data to improve S2ST performance by applying various acoustic effects to the generated synthetic data. Empirically our method outperforms the state of the art in Spanish-English translation by up to 2 BLEU. Significant gains by the proposed method are demonstrated in extremely low-resource settings for both Spanish-English and Russian-English translations.
翻译:直接语音对语音翻译(S2ST)是翻译模式中最具挑战性的问题之一,因为S2ST数据严重缺乏。虽然已经作出努力,通过未经训练的语音识别(ASR)、机器翻译(MT)和文本对语音翻译(TTS)模式,增加无标签的语音发言的数据规模;没有标记的文本仍然相对利用不足,以改善S2ST。我们建议了一种有效的方法,利用来自不同语言的大量现有未标记文本,创造大量S2ST数据,通过对生成的合成数据应用各种声学效应,提高S2ST的性能。我们的方法在西班牙语英语翻译方面比西班牙语和俄语翻译的艺术水平高出了多达2个BLEU。 拟议方法在极低的资源环境中展示了西班牙语-英语和俄语-英语翻译的巨大成果。