DSPGAN:通过来自DSP的时频域监督,为高忠诚 TTS 建立一个基于GAN的通用通用电码器 (DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP)

Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown their advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in training phase and the predicted spectrograms in inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-Speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder. We also utilize sine excitation as the time-domain supervision to improve the harmonic modeling and eliminate various artifacts of the GAN-based vocoder. Experimental results show that DSPGAN significantly outperforms the compared approaches and can generate high-fidelity speech based on diverse data in TTS.

翻译：根据基因对抗神经网络(GAN)最近开发的神经蒸气器显示,它们具有产生以光谱速速和轻量网络为条件的光谱光谱成形的原始波形的优势。而培训一个通用的神经蒸气器仍然具有挑战性,它能以隐蔽的语音、语言和语调方式综合从各种情景中产生的高不忠言论。在本文中,我们建议DSPGAN,一个基于GAN的、通用的高异性声音合成调解调器,从数字信号处理(DSP)中应用时频域监督。为了消除培训阶段的地面光谱图和预测的光谱化阶段造成的不匹配问题,我们利用从DSP模块产生的波形中提取的光谱,而不是从文本到Speech(TTTS)的预测的Mel-spectrogrogram,作为基于GAN的语音模型(DSP)的时频域监督,我们还可以利用Siming-PHAL-Androductions,作为GANDRA的高级分析方法,从而大大改进了GNA-ADRA的高级分析。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

生成对抗网络GAN在各领域应用研究进展(中文版)，37页pdf

专知会员服务

151+阅读 · 2020年12月30日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日