We present the Melody Guided Music Generation (MG2) model, a novel approach using melody to guide the text-to-music generation that, despite a pretty simple method and extremely limited resources, achieves excellent performance. Specifically, we first align the text with audio waveforms and their associated melodies using the newly proposed Contrastive Language-Music Pretraining, enabling the learned text representation fused with implicit melody information. Subsequently, we condition the retrieval-augmented diffusion module on both text prompt and retrieved melody. This allows MG2to generate music that reflects the content of the given text description, meantime keeping the intrinsic harmony under the guidance of explicit melody information. We conducted extensive experiments on two public datasets: MusicCaps and MusicBench. The experimental results demonstrate that the proposed MG2 model surpasses current open-source text-to-music generation models, utilizing fewer than 1/3 of the parameters and less than 1/200 of the training data compared to state-of-the-art counterparts. Furthermore, we carried out comprehensive human evaluations to explore the potential applications of MG2 in real-world scenarios.
翻译:暂无翻译