Generating natural speech with a diverse and smooth prosody pattern is a challenging task. Although random sampling with phone-level prosody distribution has been investigated to generate different prosody patterns, the diversity of the generated speech is still very limited and far from what can be achieved by humans. This is largely due to the use of uni-modal distribution, such as single Gaussian, in the prior works of phone-level prosody modelling. In this work, we propose a novel approach that models phone-level prosodies with a GMM-based mixture density network(MDN) and then extend it for multi-speaker TTS using speaker adaptation transforms of Gaussian means and variances. Furthermore, we show that we can clone the prosodies from a reference speech by sampling prosodies from the Gaussian components that produce the reference prosodies. Our experiments on LJSpeech and LibriTTS dataset show that the proposed method with GMM-based MDN not only achieves significantly better diversity than using a single Gaussian in both single-speaker and multi-speaker TTS, but also provides better naturalness. The prosody cloning experiments demonstrate that the prosody similarity of the proposed method with GMM-based MDN is comparable to recent proposed fine-grained VAE while the target speaker similarity is better.
翻译:以多样和顺畅的流体模式生成自然语言,是一项具有挑战性的任务。尽管已经调查了使用电话级流体分布的随机抽样,以产生不同的流体模式,但所产生的语言的多样性仍然非常有限,而且远非人类所能达到的目标。这在很大程度上是由于在先前的电话级流体模拟工作中使用了单一模式分布,如单高山。在这项工作中,我们提出了一种新颖的方法,即用基于GMM的混合密度网络(MDN)模拟电话级推进器,然后将它推广到使用Gaussian手段和差异变换语音变换的多声器的TTTTS。此外,我们表明,我们可以通过对制作参考模型的戈萨组成部分(如单一高巴人)的预言进行取样,从参考演讲中克隆出一些推进器。我们关于LJSpeech和LlibriTTTS的实验显示,与以GMN为主的演讲人基的混合密度网络相比,拟议的方法不仅比使用单一高音员的多语组多,而且使用高音器对高音员的TTTS的变换。我们提议的近的模拟模拟的模拟实验也展示了更像性样的模拟方法。