Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density. The framework assumes the prior noise as a standard Gaussian distribution, whereas the corresponding data distribution may be more complicated than the standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior noise into the data sample because of the discrepancy between the data and the prior. In this paper, we propose PriorGrad to improve the efficiency of the conditional diffusion model (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive prior derived from the data statistics based on the conditional information. We formulate the training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior through a theoretical analysis. Focusing on the audio domain, we consider the recently proposed diffusion-based audio generative models based on both the spectral and time domains and show that PriorGrad achieves a faster convergence leading to data and parameter efficiency and improved quality, and thereby demonstrating the efficiency of a data-driven adaptive prior.
翻译:最近提出了通过估计数据密度的梯度生成高质量样本的消化性扩散概率模型。框架假设先前的噪音是标准高斯分布标准,而相应的数据分布可能比标准高斯分布更为复杂,因为数据与先前的差异,高斯分布可能导致数据样本中先前噪音的稀释效率低下。在本文件中,我们提议了 " 前格拉德 " 提高有条件传播模型的效率(例如,使用Mel-spectrogram作为条件的vocoder),方法是应用基于有条件信息的数据统计数据的适应性先前得出的数据统计数据。我们制定了前格拉德的培训和取样程序,并通过理论分析展示了先前适应性的优势。我们侧重于听力领域,考虑最近提出的基于光谱和时间域的基于传播的音频谱谱谱化模型,并表明 " 前格拉德 " 实现更快的趋同,从而展示了数据驱动的适应前效率。