Denoising diffusion probabilistic models (diffusion models for short) require a large number of iterations in inference to achieve the generation quality that matches or surpasses the state-of-the-art generative models, which invariably results in slow inference speed. Previous approaches aim to optimize the choice of inference schedule over a few iterations to speed up inference. However, this results in reduced generation quality, mainly because the inference process is optimized separately, without jointly optimizing with the training process. In this paper, we propose InferGrad, a diffusion model for vocoder that incorporates inference process into training, to reduce the inference iterations while maintaining high generation quality. More specifically, during training, we generate data from random noise through a reverse process under inference schedules with a few iterations, and impose a loss to minimize the gap between the generated and ground-truth data samples. Then, unlike existing approaches, the training of InferGrad considers the inference process. The advantages of InferGrad are demonstrated through experiments on the LJSpeech dataset showing that InferGrad achieves better voice quality than the baseline WaveGrad under same conditions while maintaining the same voice quality as the baseline but with $3$x speedup ($2$ iterations for InferGrad vs $6$ iterations for WaveGrad).
翻译:在本文中,我们提议为电算器设计一个将推断过程纳入培训的传播模型,以减少推论的推论重复,同时保持高生成质量。更具体地说,在培训期间,我们通过随机噪音生成数据,方法是在几次反复推论的推论下,通过随机噪声生成数据,在几次反复推论下进行反向进程,并造成损失,以尽量减少生成数据与地面光度数据样本之间的差距。然后,与现有方法不同,InferGrad的培训将分析推论过程视为推论过程。在LJSpeerGrad的测距下,通过测距比其质量更好的测距,以显示其质量。