Research on speech-to-speech translation (S2ST) has progressed rapidly in recent years. Many end-to-end systems have been proposed and show advantages over conventional cascade systems, which are often composed of recognition, translation and synthesis sub-systems. However, most of the end-to-end systems still rely on intermediate textual supervision during training, which makes it infeasible to work for languages without written forms. In this work, we propose a novel model, Textless Translatotron, which is based on Translatotron 2, for training an end-to-end direct S2ST model without any textual supervision. Instead of jointly training with an auxiliary task predicting target phonemes as in Translatotron 2, the proposed model uses an auxiliary task predicting discrete speech representations which are obtained from learned or random speech quantizers. When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2 on the multilingual CVSS-C corpus as well as the bilingual Fisher Spanish-English corpus. On the latter, it outperforms the prior state-of-the-art textless model by +18.5 BLEU.
翻译:近年来,对语音到语音翻译(S2ST)的研究进展迅速,许多端到端系统已经提出,并显示出对常规级联系统的优势,传统级联系统通常由识别、翻译和合成子系统组成,但是,大多数端到端系统在培训期间仍然依赖中间文本监督,这使得无法为没有书面形式的语言工作。在这项工作中,我们提议了一种新型模型,即无文本的转写器,它以 Translatotron 2为基础,用于培训一个端到端直接的S2ST模型,而没有任何文字监督。拟议模式使用辅助任务,预测Translatoron 2的目标电话,而不是在Translatoron 2中联合培训。拟议模式使用辅助任务,预测从学习的或随机语音微调器获得的离散语音表达。当对两种模式使用未经校正的语音数据进行预先培训的演讲编码器时,拟议模式在多语种的CVSS-C-C-C-C-Transteron 2上获得近乎翻译质量的翻译质量质量,以及双语的西班牙-英语-英语模型,后由前者取代。