This paper aims to introduce a robust singing voice synthesis (SVS) system to produce very natural and realistic singing voices efficiently by leveraging the adversarial training strategy. On one hand, we designed simple but generic random area conditional discriminators to help supervise the acoustic model, which can effectively avoid the over-smoothed spectrogram prediction and improve the expressiveness of SVS. On the other hand, we subtly combined the spectrogram with the frame-level linearly-interpolated F0 sequence as the input for the neural vocoder, which is then optimized with the help of multiple adversarial conditional discriminators in the waveform domain and multi-scale distance functions in the frequency domain. The experimental results and ablation studies concluded that, compared with our previous auto-regressive work, our new system can produce high-quality singing voices efficiently by fine-tuning different singing datasets covering from several minutes to a few hours. A large number of synthesized songs with different timbres are available online https://zzw922cn.github.io/wesinger2 and we highly recommend readers to listen to them.
翻译:本文旨在引入一个强大的声音合成系统(SVS),通过利用对抗性培训策略,高效地产生非常自然和现实的歌声。一方面,我们设计了简单但通用的随机区域有条件歧视器,以帮助监督声学模型,这可以有效避免超移动的光谱预测,并提高SVS的表达性。另一方面,我们把光谱与框架级线性线性插接F0序列相结合,作为神经电动电动器的输入,然后在波形域和频率域的多尺度远程功能中利用多个对抗性有条件歧视器进行优化。实验结果和反动研究得出结论,与我们以往的自动反向工作相比,我们的新系统能够通过对覆盖数分钟至几个小时的不同歌声数据集进行微调,从而产生高质量的歌声声音。许多配有不同Timbres的合成歌曲可以在网上查阅 https://zzwn.github.io/weringer2,我们强烈建议读者倾听它们。