Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called \textit{WaveFit}, which integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration. WaveFit iteratively denoises an input signal, and trains a deep neural network (DNN) for minimizing an adversarial loss calculated from intermediate outputs at all iterations. Subjective (side-by-side) listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations. Furthermore, the inference speed of WaveFit was more than 240 times faster than WaveRNN. Audio demos are available at \url{google.github.io/df-conformer/wavefit/}.
翻译:DDPM 和 GAN 分别以迭代分解框架和对抗性培训为特征。本研究提出了一种叫作\ textit{WaveFit}的快速和高质量的神经蒸馏器,它将GAN 的精髓纳入基于固定点迭代的DDPM 类似迭接框架。WaveFit 反复输入一个输入信号,并训练一个深神经网络,以尽量减少从所有迭代的中间输出计算出来的对抗性损失。主观(旁侧)听觉测试没有显示人类自然言与由波信合成的自然与五次迭代之间在自然特性上的重大差异。此外,波信的发速度比WaveRNN快240倍以上。在\urlgle.github.io/df-conf/worp/worup/}提供音频演示。