Recent progress in deep generative models has improved the quality of neural vocoders in speech domain. However, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we propose a hierarchical diffusion model for singing voice neural vocoders. The proposed method consists of multiple diffusion models operating in different sampling rates; the model at the lowest sampling rate focuses on generating accurate low-frequency components such as pitch, and other models progressively generate the waveform at higher sampling rates on the basis of the data at the lower sampling rate and acoustic features. Experimental results show that the proposed method produces high-quality singing voices for multiple singers, outperforming state-of-the-art neural vocoders with a similar range of computational costs.
翻译:深层基因模型的最近进展提高了语音领域神经动脉变声器的质量,然而,由于音响、声响和发音的音乐表达形式多种多样,产生高质量的歌唱声音仍具有挑战性。在这项工作中,我们提议了歌声动脉变声器的等级扩散模式。拟议方法包括以不同采样率运作的多种传播模式;采用最低采样率的模型侧重于生成精确的低频组件,如投影等,而其他模型则根据较低采样率和声学特征的数据,以更高的采样率逐步生成波变。实验结果表明,拟议方法为多个歌手产生高质量的歌唱声音,其表现优于艺术动脉变声器,其计算成本范围相似。