We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.
翻译:我们引入了MusicLM(MusicLM ), 音乐LM(MusicLM ), 以文字描述(比如“由扭曲的吉他支持的平息小提琴旋律 ” ) 为题材, 音乐LM(MusicLM ) 将有条件的音乐生成过程作为按顺序顺序顺序排序的建模任务, 并在24 kHz(24 kHz) 生成音乐, 持续了几分钟。 我们的实验显示MusicLM(MusicLM)在音质和对文字描述的坚持性两方面都超越了先前的系统。 此外, 我们证明音乐LM(M) 可以同时以文字和旋律为条件, 因为它可以按照文字说明中描述的风格转换口哨和曲调。 为了支持未来的研究, 我们公开发行了由5.5k音乐文本配对组成的数据集, 由人类专家提供丰富的文字描述。