This paper presents Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art on inter-speaker and inter-text prosody transfer. This improvement is achieved using FiLM conditioning layers, alongside adversarial training that encourages disentanglement between prosodic information and speaker identity. The acoustic model inherits attractive qualities from FastSpeech 2, such as fast inference and local prosody attributes prediction for finer grained control over generation. Experimental results show that Daft-Exprt significantly outperforms strong baselines on prosody transfer tasks, while yielding naturalness comparable to state-of-the-art expressive models. Moreover, results indicate that adversarial training effectively discards speaker identity information from the prosody representation, which ensures Daft-Exprt will consistently generate speech with the desired voice. We publicly release our code and provide speech samples from our experiments.
翻译:本文介绍了Daft-Exprert(Daft-Exprert),这是一个多声音的声学模型,它推进了声音间和文本间推进传输的最先进技术。这一改进是使用FILM调控层实现的,同时进行对抗性培训,鼓励分解预言信息和发言者身份。声学模型从快速Spech 2中继承了吸引人的品质,如快速推论和当地手动特性预测,以更好地控制一代。实验结果显示,Daft-Expropt(Daft-Expropt)大大超越了手动转移任务的强大基线,同时产生与最新表达模型相仿的自然特性。此外,结果显示,对抗性培训有效地提供了来自Prosody(Prosody)代表的发言者身份信息,从而确保Daft-Extrat(Daft-Extrat)能够以预期的声音持续生成语音。我们公开发布我们的代码并提供我们实验的语音样本。