Recent progress in deep generative models has improved the quality of neural vocoders in speech domain. However, it remains challenging to generate high-quality singing voice due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we propose a hierarchical diffusion model for singing voice neural vocoders. The proposed method consists of multiple diffusion models operating in different sampling rates; the model at the lowest sampling rate focuses on generating accurate low frequency components such as pitch, and other models progressively generate the waveform at the higher sampling rates based on the data at the lower sampling rate and acoustic features. Experimental results show that the proposed method produces high-quality singing voice for multiple singers, outperforming state-of-the-art neural vocoders with a similar range of computational costs.
翻译:深层基因模型的最近进展提高了语音领域神经动脉变声器的质量,然而,由于音响、声响和音响的音乐表达形式多种多样,生成高质量的歌唱声音仍具有挑战性。在这项工作中,我们为歌声神经动脉变声器提出了一个等级分级的传播模式。拟议方法包括以不同采样率运作的多种传播模式;采用最低采样率的模型侧重于产生精确的低频率组件,如投影等,而其他模型则根据较低采样率和声学特征的数据,以较高采样率逐渐产生波变。实验结果显示,拟议方法为多个歌手产生高质量的歌唱声音,其性能优于艺术动脉变声器,其计算成本范围相似。