Speech-to-speech translation directly translates a speech utterance to another between different languages, and has great potential in tasks such as simultaneous interpretation. State-of-art models usually contains an auxiliary module for phoneme sequences prediction, and this requires textual annotation of the training dataset. We propose a direct speech-to-speech translation model which can be trained without any textual annotation or content information. Instead of introducing an auxiliary phoneme prediction task in the model, we propose to use bottleneck features as intermediate training objectives for our model to ensure the translation performance of the system. Experiments on Mandarin-Cantonese speech translation demonstrate the feasibility of the proposed approach and the performance can match a cascaded system with respect of translation and synthesis qualities.
翻译:语音对语音翻译直接翻译到不同语言之间的另一种语言,在同声传译等任务方面具有巨大的潜力。 最先进的模型通常包含一个配音序列预测的辅助模块,这要求对培训数据集进行文字说明。 我们提议一个直接的语音对语音翻译模式,无需任何文字说明或内容信息即可对其进行培训。 我们提议,在模型中不引入辅助电话预测任务,而是将瓶颈特征作为我们模型的中间培训目标,以确保系统的翻译性能。 曼达林-坎昆语言翻译实验展示了拟议方法的可行性,其性能在翻译和合成质量方面可以匹配一个连锁系统。