We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the \vp~S2ST dataset, compared to a baseline trained on un-normalized speech target. We also incorporate automatically mined S2ST data and show an additional 2.0 BLEU gain. To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs.
翻译:我们推出一个无文字语音翻译系统(S2ST),可以将语言从一种语言翻译成另一种语言,并且可以在不需要任何文本数据的情况下建立。与文献中现有的工作不同,我们应对模拟多语音目标语言的挑战,并用真实世界S2ST数据培训系统。我们的方法的关键是自监管单位语言正常化技术,它精细地将预先培训的语音编码器与多个发言者配对音频和单一引用演讲器进行精细化,以减少口音的变化,同时保存词汇内容。只有10分钟的配对数据可以实现语言正常化。我们通过在 & vp~S2ST 数据集上培训S2ST 模型来平均获得3.2 BLEU的收益,而该模型与在非正常语言语言目标上培训的基线相比。我们还将自动提取的S2ST 数据整合,并显示额外的2.0 BLEU 收益。据我们所知,我们首先建立了一种无文本的S2ST 技术,可以用真实世界数据和多语言配对工作来培训。