ResGrad: 文字对讲法的剩余否认传播概率模型 (ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech)

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.

翻译：在文本到语音综合模型(DDPMs)中,由于具有生成高纤维样本的强大能力,正在出现文字到声音的合成模型(DDPMs)。然而,高维数据空间的迭代精细过程导致推导速度缓慢,限制了实时系统的应用。以前的工作探索了加快速度的方法是尽量减少推论步骤的数量,但以抽样质量为代价。在这项工作中,提高DDPM基于DDPM的TTTS模型的推导速度,同时实现高样本质量,我们提议了ResGrad,一个轻量的类似扩散模型,通过预测模型输出和相应的地面系统应用系统应用的剩余力来改进现有TTTS模型(例如,快速Speast2)的输出光度光度光度光度光度光度光度光度光度光度。ResGrad有以下几个优点:1)与DDPM的其他加速方法进行对比,这种加速方法需要从抓起合成语音模型,ResGrad 降低任务的复杂性,通过将生成目标从地面光度光度显示光度的光度光度光度比光谱速度到残留,因此,在高科技模型中,在高科技模型中,在Spetral-rel-rel-ration exbrelation exbrelation rolation rolational rolation rolation rol rol rol pral ex pral pral ex ex ex ex ex lax ex ex ex exrealder practaltial-d retra lades lades lades lades lax lax lades lades lax lax lades lades lades lax lax lax lax lax lax lax lax lax lax lax lax lax lades lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日