Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown their advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in training phase and the predicted spectrograms in inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-Speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder. We also utilize sine excitation as the time-domain supervision to improve the harmonic modeling and eliminate various artifacts of the GAN-based vocoder. Experimental results show that DSPGAN significantly outperforms the compared approaches and can generate high-fidelity speech based on diverse data in TTS.
翻译:根据基因对抗神经网络(GAN)最近开发的神经蒸气器显示,它们具有产生以光谱速速和轻量网络为条件的光谱光谱成形的原始波形的优势。而培训一个通用的神经蒸气器仍然具有挑战性,它能以隐蔽的语音、语言和语调方式综合从各种情景中产生的高不忠言论。在本文中,我们建议DSPGAN,一个基于GAN的、通用的高异性声音合成调解调器,从数字信号处理(DSP)中应用时频域监督。为了消除培训阶段的地面光谱图和预测的光谱化阶段造成的不匹配问题,我们利用从DSP模块产生的波形中提取的光谱,而不是从文本到Speech(TTTS)的预测的Mel-spectrogrogram,作为基于GAN的语音模型(DSP)的时频域监督,我们还可以利用Siming-PHAL-Androductions,作为GANDRA的高级分析方法,从而大大改进了GNA-ADRA的高级分析。