In this paper, we develop a new multi-singer Chinese neural singing voice synthesis (SVS) system named WeSinger. To improve the accuracy and naturalness of synthesized singing voice, we design several specifical modules and techniques: 1) A deep bi-directional LSTM-based duration model with multi-scale rhythm loss and post-processing step; 2) A Transformer-alike acoustic model with progressive pitch-weighted decoder loss; 3) a 24 kHz pitch-aware LPCNet neural vocoder to produce high-quality singing waveforms; 4) A novel data augmentation method with multi-singer pre-training for stronger robustness and naturalness. To our knowledge, WeSinger is the first SVS system to adopt 24 kHz LPCNet and multi-singer pre-training simultaneously. Both quantitative and qualitative evaluation results demonstrate the effectiveness of WeSinger in terms of accuracy and naturalness, and WeSinger achieves state-of-the-art performance on the recent public Chinese singing corpus Opencpop\footnote{https://wenet.org.cn/opencpop/}. Some synthesized singing samples are available online\footnote{https://zzw922cn.github.io/wesinger/}.
翻译:在本文中,我们开发了名为WeSinger的中国神经神经歌声合成(SVS)系统。为了提高合成歌声的准确性和自然性,我们设计了若干具体的模块和技术:1)基于双向LSTM的深度双向持续时间模型,具有多种规模的节奏损失和后处理步骤;2)类似于变压器的声学模型,具有渐进式声学加权脱coder损失;3)一个24千赫兹声觉LPCNet神经蒸气器,以产生高质量的歌声波形;4)一种新型数据增强方法,配有多声器预培训,以培养更强的稳健性和自然性。据我们所知,WSinger是第一个同时采用24千赫兹LPCNet和多声学预培训的SVS系统。定量和定性评估结果都表明WSinger在准确性和自然性方面的有效性,WeSinger在近期公共歌机 Opickfoote{http://wenetnetc samps.comproduning/s.orgs.orgs/www/wwwproplopenc/s.orgs.orgs.org/s.org/s。