Most modern text-to-speech architectures use a WaveNet vocoder for synthesizing high-fidelity waveform audio, but there have been limitations, such as high inference time, in its practical application due to its ancestral sampling scheme. The recently suggested Parallel WaveNet and ClariNet have achieved real-time audio synthesis capability by incorporating inverse autoregressive flow for parallel sampling. However, these approaches require a two-stage training pipeline with a well-trained teacher network and can only produce natural sound by using probability distillation along with auxiliary loss terms. We propose FloWaveNet, a flow-based generative model for raw audio synthesis. FloWaveNet requires only a single-stage training procedure and a single maximum likelihood loss, without any additional auxiliary terms, and it is inherently parallel due to the characteristics of generative flow. The model can efficiently sample raw audio in real-time, with clarity comparable to previous two-stage parallel models. The code and samples for all models, including our FloWaveNet, are publicly available.
翻译:大多数现代文本到语音结构都使用WaveNet电码器合成高纤维化波形音频,但由于其祖先的取样计划,在实际应用方面存在着一些限制,例如高推推时,由于其祖先的取样计划,在实际应用方面存在着一些限制。最近建议的平行波网和ClariNet实现了实时音频合成能力,将反向自动递增流纳入平行取样。然而,这些方法需要有一个经过良好培训的教师网络,分为两个阶段的培训管道,并且只能通过利用概率蒸馏以及附带损失术语来产生自然声音。我们提议FloWaveNet,这是一种以流动为基础的原始音频合成基因模型。FloWaveNet只需要一个单阶段的培训程序和一个单一最大可能性的损失,而无需附加任何辅助术语。由于基因流动的特点,它具有内在的平行性。模型可以有效地实时采样原始音频,与以前的两阶段平行模型相近。所有模型的代码和样本,包括我们的FloWaveNet,都可以公开获得。