Text-to-speech synthesis (TTS) is a task to convert texts into speech. Two of the factors that have been driving TTS are the advancements of probabilistic models and latent representation learning. We propose a TTS method based on latent variable conversion using a diffusion probabilistic model and the variational autoencoder (VAE). In our TTS method, we use a waveform model based on VAE, a diffusion model that predicts the distribution of latent variables in the waveform model from texts, and an alignment model that learns alignments between the text and speech latent sequences. Our method integrates diffusion with VAE by modeling both mean and variance parameters with diffusion, where the target distribution is determined by approximation from VAE. This latent variable conversion framework potentially enables us to flexibly incorporate various latent feature extractors. Our experiments show that our method is robust to linguistic labels with poor orthography and alignment errors.
翻译:文本到语音合成( TTS) 是将文本转换为语音的任务 。 驱动 TTS 的两个因素是概率模型和潜在代表学习的进步。 我们建议了一种基于潜在变量转换的 TTS 方法, 使用扩散概率模型和变式自动读数模型( VAE ) 。 在我们的 TTS 方法中, 我们使用基于 VAE 的波形模型, 一种预测波形模型中潜在变量从文本中分布的分布的扩展模型, 以及一个能够学习文本和语言潜伏序列之间对齐的校准模型。 我们的方法是将平均参数和差异参数与扩散的模型结合起来, 目标分布由 VAE 的近似值决定。 这个潜在变量转换框架可能使我们能够灵活地整合各种潜在特征提取器。 我们的实验表明, 我们的方法对语言标签的拼写率差和校正错误都很强。