Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown obvious advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in the training phase and the predicted spectrograms in the inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-Speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder. We also utilize sine excitation as the time-domain supervision to improve the harmonic modeling and eliminate various artifacts of the GAN-based vocoder. Experiments show that DSPGAN significantly outperforms the compared approaches and it can generate high-fidelity speech for various TTS models trained using diverse data.
翻译:根据基因对抗神经网络(GAN)最近开发的神经蒸气器显示了产生以光谱速速和轻量网络为条件的光谱成形的原始波形的明显优势。而培训一个通用的神经蒸气器仍然具有挑战性,它能以隐蔽的语音、语言和语调方式综合从各种情景中产生的高信仰言词。在本文中,我们提议DSPGAN,一个基于GAN的、用于高信仰状态语音合成的通用调音器,从数字信号处理(DSP)中应用时频域监督。为了消除培训阶段的地面光谱图和预测的感化阶段的光谱造成的不匹配问题,我们还利用从DSP模块产生的波形中提取的光谱,而不是从文本到语音(TTS)多样性声音模型(TTS)中预测的光谱,作为基于时频域监督的GAN语音模型(DSP)处理。我们还利用经培训的GMANS高级前演算方法来大大改进GMANS系统(H)的高级演算方法,以展示各种感化结果。</s>