Deep generative models have achieved significant progress in speech synthesis to date, while high-fidelity singing voice synthesis is still an open problem for its long continuous pronunciation, rich high-frequency parts, and strong expressiveness. Existing neural vocoders designed for text-to-speech cannot directly be applied to singing voice synthesis because they result in glitches and poor high-frequency reconstruction. In this work, we propose SingGAN, a generative adversarial network designed for high-fidelity singing voice synthesis. Specifically, 1) to alleviate the glitch problem in the generated samples, we propose source excitation with the adaptive feature learning filters to expand the receptive field patterns and stabilize long continuous signal generation; and 2) SingGAN introduces global and local discriminators at different scales to enrich low-frequency details and promote high-frequency reconstruction; and 3) To improve the training efficiency, SingGAN includes auxiliary spectrogram losses and sub-band feature matching penalty loss. To the best of our knowledge, SingGAN is the first work designed toward high-fidelity singing voice vocoding. Our evaluation of SingGAN demonstrates the state-of-the-art results with higher-quality (MOS 4.05) samples. Also, SingGAN enables a sample speed of 50x faster than real-time on a single NVIDIA 2080Ti GPU. We further show that SingGAN generalizes well to the mel-spectrogram inversion of unseen singers, and the end-to-end singing voice synthesis system SingGAN-SVS enjoys a two-stage pipeline to transform the music scores into expressive singing voices. Audio samples are available at \url{https://SingGAN.github.io/}
翻译:深基因模型迄今在语音合成方面取得了显著进展,而高信仰的歌声合成仍是一个长期持续发音、富含高频部分和强烈的直观性的问题。 用于文本到语音的现有神经语音合成器无法直接应用到语音合成中,因为它们导致发条和高频重建不力。 在这项工作中,我们提议SingGAN(SingGAN),这是为高信仰的歌声合成而设计的基因对抗网络。具体地说,1)为了缓解所生成的样本中的裂缝问题,我们建议通过适应性功能学习过滤器提供源源代码,以扩大可接受的场面模式,稳定长期的信号生成;和 2) SingGAN(SingGAN) 引入了不同规模的全球和地方歧视器,以丰富低频细节,促进高频重建; 3) 为提高培训效率,SingGAN(SingGAN) 包括辅助光谱损失和子带特征匹配惩罚损失。据我们所知, SingGANANAN(S)是用于高信仰和高音频-NEV(O-NEVS-SeralSeralSeral-G-G-Seral-G-Servial-Sy-Seral-Serview)的Syal-Syal-Servial-Servial-Servil-S-S-S-S-S-S-S-S-S-S-S-S-S-S-Serg-Sy-Sy-Sy-Servial-Servial-Servial-Servial-Sy-S-S-S-Servial-s-Servial-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-Seral-S-Seral-Seral-Seral-Seral-Seral-S-Seral-Serva-S-S-S-S-S-S-S-S-S-S-S-S-Seral-S-S-S-S-Seral-S-Seral-S