Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality especially in the high-frequency bands. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders. Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios. Audio demos are available at wavegrad.github.io/specgrad/.
翻译:使用分解扩散概率模型(DDPM)的神经活化器已经通过将扩散噪音分布适应给定的声学特性而得到改善。在本研究中,我们提议SpecGrad将扩散噪音调整,使其时间变化的光谱信封接近调节日球光谱图。这种通过时间变化过滤的适应提高了声音质量,特别是高频带的音质。它在时频域中处理,使计算成本与传统的DDPM神经电动器几乎一样。实验结果表明,SpecGrad在分析-合成和语音增强两种情景中都产生比传统的DDPM神经电动器更高的纤维化语音波变形。波格.github.io/specgrad/提供音效演示。