Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve duration modelling. First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses. We show that the duration model conditioned on phrasing improves the naturalness of speech over our baseline duration model. Second, we also propose a multi-speaker duration model called Cauliflow, that uses normalising flows to predict durations that better match the complex target duration distribution. Cauliflow performs on par with our other proposed duration model in terms of naturalness, whilst providing variable durations for the same prompt and variable levels of expressiveness. Lastly, we propose to condition Cauliflow on parameters that provide an intuitive control of the pacing and pausing in the synthesised speech in a novel way.
翻译:随着无注意力神经文本到语音系统的兴起,时间建模再次成为一个重要的研究问题。目前的做法在很大程度上要追溯到依赖以往的统计参数语音合成技术进行持续预测,而这种技术对语音的表达性和变异性模型的模型不甚理想。在本文件中,我们提出了两种改进时间建模的替代方法。首先,我们提出了一个以改进预测时间和提供更好的暂停建模的语法为条件的时间建模模式。我们表明,以语法为条件的时间建模提高了语言在基线时间建模的自然性。第二,我们还提出了一个称为Cauliproll的多发言时间建模,使用正常化流来预测时间,以更好地与复杂的目标时间分布相匹配。Cauli流在自然性方面与我们提出的其他期限建模相同,同时为同样迅速和可变的表达性提供了不同的时间段。最后,我们提议将Cauliflow的设置于能够以新方式对合成语言的节奏和节奏进行直觉控制的参数。