In this paper, we present Msanii, a novel diffusion-based model for synthesizing long-context, high-fidelity music efficiently. Our model combines the expressiveness of mel spectrograms, the generative capabilities of diffusion models, and the vocoding capabilities of neural vocoders. We demonstrate the effectiveness of Msanii by synthesizing tens of seconds (190 seconds) of stereo music at high sample rates (44.1 kHz) without the use of concatenative synthesis, cascading architectures, or compression techniques. To the best of our knowledge, this is the first work to successfully employ a diffusion-based model for synthesizing such long music samples at high sample rates. Our demo can be found https://kinyugo.github.io/msanii-demo and our code https://github.com/Kinyugo/msanii .
翻译:在本文中,我们展示了Msanii, 这是一种基于传播的新型模型, 能够高效地合成长相、 高不洁的音乐。 我们的模型结合了Mel光谱图的表达性、 扩散模型的基因化能力, 以及神经立体读数器的编译能力。 我们展示了Msanii的有效性,方法是以高采样率( 140秒) 合成数十秒立体音乐( 441 kHz ), 而不使用交配合成、 级联结构或压缩技术。 据我们所知, 这是首次成功使用基于扩散的模型, 以高采样率合成这种长相音乐样本。 我们的演示可以找到 https://kinyugo.github.io/msanii-demo 和我们的代码 https://github.com/ Kinyugo/msanii。