Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.
翻译:在文本到语音综合模型(DDPMs)中,由于具有生成高纤维样本的强大能力,正在出现文字到声音的合成模型(DDPMs)。然而,高维数据空间的迭代精细过程导致推导速度缓慢,限制了实时系统的应用。以前的工作探索了加快速度的方法是尽量减少推论步骤的数量,但以抽样质量为代价。在这项工作中,提高DDPM基于DDPM的TTTS模型的推导速度,同时实现高样本质量,我们提议了ResGrad,一个轻量的类似扩散模型,通过预测模型输出和相应的地面系统应用系统应用的剩余力来改进现有TTTS模型(例如,快速Speast2)的输出光度光度光度光度光度光度光度光度光度光度。ResGrad有以下几个优点:1)与DDPM的其他加速方法进行对比,这种加速方法需要从抓起合成语音模型,ResGrad 降低任务的复杂性,通过将生成目标从地面光度光度显示光度的光度光度光度比光谱速度到残留,因此,在高科技模型中,在高科技模型中,在Spetral-rel-rel-ration exbrelation exbrelation rolation rolational rolation rolation rol rol rol pral ex pral pral ex ex ex ex ex lax ex ex ex exrealder practaltial-d retra lades lades lades lades lax lax lades lades lax lax lades lades lades lax lax lax lax lax lax lax lax lax lax lax lax lax lades lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax