Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density. The framework defines the prior noise as a standard Gaussian distribution, whereas the corresponding data distribution may be more complicated than the standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior noise into the data sample because of the discrepancy between the data and the prior. In this paper, we propose PriorGrad to improve the efficiency of the conditional diffusion model for speech synthesis (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive prior derived from the data statistics based on the conditional information. We formulate the training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior through a theoretical analysis. Focusing on the speech synthesis domain, we consider the recently proposed diffusion-based speech generative models based on both the spectral and time domains and show that PriorGrad achieves faster convergence and inference with superior performance, leading to an improved perceptual quality and robustness to a smaller network capacity, and thereby demonstrating the efficiency of a data-dependent adaptive prior.
翻译:最近提出了通过估计数据密度的梯度来生成高质量的扩散概率模型。框架将先前的噪音定义为标准高斯分布,而相应的数据分布可能比标准高斯分布更为复杂,因为数据与先前的数据差异,可能导致数据样本中先前噪音的稀释效率低下。在本文件中,我们提议先导法提高语言合成条件性传播模型的效率(例如,使用线谱作为条件的vocoder),方法是应用基于有条件信息的数据统计数据的适应性前导出的方法。我们制定了前格拉德的培训和取样程序,并通过理论分析展示了适应性前先导的优点。我们侧重于语音合成领域,审议了最近提出的基于光谱和时间域的基于传播的语音谱化模型,并表明前格拉德与优异的性能更快地趋同和推断,从而提高了对较小网络能力的认知质量和稳健性,从而展示了依赖先前数据调整的效率。