High-fidelity singing voice synthesis is challenging for neural vocoders due to extremely long continuous pronunciation, high sampling rate and strong expressiveness. Existing neural vocoders designed for text-to-speech cannot directly be applied to singing voice synthesis because they result in glitches in the generated spectrogram and poor high-frequency reconstruction. To tackle the difficulty of singing modeling, in this paper, we propose SingGAN, a singing voice vocoder with generative adversarial network. Specifically, 1) SingGAN uses source excitation to alleviate the glitch problem in the spectrogram; and 2) SingGAN adopts multi-band discriminators and introduces frequency-domain loss and sub-band feature matching loss to supervise high-frequency reconstruction. To our knowledge, SingGAN is the first vocoder designed towards high-fidelity multi-speaker singing voice synthesis. Experimental results show that SingGAN synthesizes singing voices with much higher quality (0.41 MOS gains) over the previous method. Further experiments show that combined with FastSpeech~2 as an acoustic model, SingGAN achieves high robustness in the singing voice synthesis pipeline and also performs well in speech synthesis.
翻译:由于超长的连续发音、高采样率和强烈的直观性能,高纤维歌声合成对神经电动器具有挑战性。现有的用于文本到语音的现有神经电动电动器不能直接用于歌唱合成,因为它们导致生成的光谱破碎和高频重建不良。为了解决歌唱模型的难度,我们在本文件中提议SingGAN为SingGAN,一个配有基因对抗网络的歌声语音合成器。具体地说,1 SingGAN利用源解说来缓解光谱中的 glitch 问题;2 SingGAN采用多波段歧视器,引入频率-光学损失和子波段特征匹配损失来监督高频重建。据我们所知,SingGAN是第一个设计高纤维多音频语音合成的电动电动器。实验结果表明,SingGAN对前一种方法合成方法合成了高质量(0.41 MOS收益)。进一步实验显示,与快速Spech-2合成声音合成器一起,还作为声波合成的合成模型,SingGANANAN实现高度的合成。