Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system. By doing so, we aim to sample with more prosodic variability, while gaining controllability over the latent space's structure. By using as prior the posterior distribution of a secondary VAE, which we condition on a speaker vector, we can sample from the primary VAE taking explicitly the conditioning into account and resulting in samples from a specific region of the latent space for each condition (i.e. speaker). A formal preference test demonstrates significant preference of the proposed approach over standard Conditional VAE. We also provide visualisations of the latent space where well-separated condition-specific clusters appear, as well as ablation studies to better understand the behaviour of the system.
翻译:生成模型,例如变异自动读数器(VAEs),捕捉这种变异性,允许通过取样对同一个句子进行多次移解。预测变异性的程度在很大程度上取决于取样时使用的先前情况。在本文中,我们提出了一个新颖的方法来计算神经文字对声音(TTS)系统在VAE潜伏空间之前的信息。通过这样做,我们的目标是在对潜在空间结构进行更多的预想变性取样,同时获得对潜在空间结构的控制性。通过使用我们以语言矢量为条件的二级VAE的后方分布,我们可以从初级VAE中进行抽样,明确考虑到每个条件(即演讲者)的潜在空间的特定区域,并由此得出样本。正式的优惠测试表明,拟议方法比标准条件变异性VAE有显著的偏好。我们还提供潜在空间的可视化图象,以显示条件特定组群的位置,以及进行更好的系统行为分析。