We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. We also propose a new method for retaining the source speaker's voice in the translated speech. The trained model is restricted to retain the source speaker's voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker's voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts. When the new method is used together with a simple concatenation-based data augmentation, the trained Translatotron 2 model is able to retain each speaker's voice for input with speaker turns.
翻译:我们展示 Translatoron 2, 一个神经直接语音对语音的翻译模型, 可以经过训练的终端到终端。 Translatoron 2 由语音编码器、 电话解码器、 元谱合成器和一个连接所有前三个组成部分的注意模块组成。 实验结果表明, Translatoron 2 在翻译质量和预测的语音自然性方面有很大的比原Translatoron高, 并且通过减缓过度生成的音频制品的可能性, 大大提高了预言的稳健性。 我们还提出了一个在翻译的演讲中保留源语者声音的新方法。 这个经过训练的模型仅限于保留源语者的声音, 与原来的 Translatoron不同, 它无法以不同的声音生成声音, 使该模型在制作时更加稳健健, 减少在制作假音制品方面的潜在误用。 当新方法与简单的基于配置的数据增强力一起使用时, 受过训练的 Translatoron 2 模型能够保留每个发言者的语音输入。