Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently. The task is very challenging due to data scarcity and complex speech-to-speech mapping. In this paper, we report our recent achievements in S2ST. Firstly, we build a S2ST Transformer baseline which outperforms the original Translatotron. Secondly, we utilize the external data by pseudo-labeling and obtain a new state-of-the-art result on the Fisher English-to-Spanish test set. Indeed, we exploit the pseudo data with a combination of popular techniques which are not trivial when applied to S2ST. Moreover, we evaluate our approach on both syntactically similar (Spanish-English) and distant (English-Chinese) language pairs. Our implementation is available at https://github.com/fengpeng-yue/speech-to-speech-translation.
翻译:最近,直接语音对语音翻译(S2ST)引起越来越多的关注。由于数据稀缺和复杂的语音对语音映射,任务非常艰巨。在本文中,我们报告我们在S2ST中最近取得的成就。首先,我们建立了一个S2ST变换器基线,该基线比原Translatotron高。第二,我们通过假标签使用外部数据,并在Fisher英语对西班牙测试集中获取新的最新技术结果。事实上,我们利用假数据结合流行技术,这些技术在应用到S2ST时并非微不足道。此外,我们评估了我们对于同义语言(西班牙语-英语)和遥远语言(英语-中文)的结合方法。我们的实施可以在https://github.com/fengpeng-yue/speech-to-speetch-traculation上查阅。