The recent surge in popularity of diffusion models for image generation has brought new attention to the potential of these models in other areas of media generation. One area that has yet to be fully explored is the application of diffusion models to audio generation. Audio generation requires an understanding of multiple aspects, such as the temporal dimension, long term structure, multiple layers of overlapping sounds, and the nuances that only trained listeners can detect. In this work, we investigate the potential of diffusion models for audio generation. We propose a set of models to tackle multiple aspects, including a new method for text-conditional latent audio diffusion with stacked 1D U-Nets, that can generate multiple minutes of music from a textual description. For each model, we make an effort to maintain reasonable inference speed, targeting real-time on a single consumer GPU. In addition to trained models, we provide a collection of open source libraries with the hope of simplifying future work in the field. Samples can be found at https://bit.ly/audio-diffusion. Codes are at https://github.com/archinetai/audio-diffusion-pytorch.
翻译:最近对图像生成传播模型的流行程度的上升使人们重新注意到这些模型在媒体生成的其他领域的潜力。尚未充分探索的一个领域是将传播模型应用于音频生成。音频生成需要了解多个方面,例如时间维度、长期结构、多层重叠声音以及只有经过培训的听众才能探测到的细微问题。我们在此工作中调查了传播模型对音频生成的潜力。我们提出了一套解决多种问题的模型,包括一套用堆叠的1D U-Nets生成文本-有条件潜在音频传播的新方法,它可以从文本描述中产生多分钟的音乐。我们努力保持合理的推论速度,针对单一消费者GPU实时。除了经过培训的模式外,我们还收集开放源图书馆,希望简化外地的未来工作。样本可以在 https://bit.ly/audio-divolution找到。代码在 https://github.com/archinetai/audio-difunf-pytorch。