Fast and user-controllable music generation could enable novel ways of composing or performing music. However, state-of-the-art music generation systems require large amounts of data and computational resources for training, and are slow at inference. This makes them impractical for real-time interactive use. In this work, we introduce Musika, a music generation system that can be trained on hundreds of hours of music using a single consumer GPU, and that allows for much faster than real-time generation of music of arbitrary length on a consumer CPU. We achieve this by first learning a compact invertible representation of spectrogram magnitudes and phases with adversarial autoencoders, then training a Generative Adversarial Network (GAN) on this representation for a particular music domain. A latent coordinate system enables generating arbitrarily long sequences of excerpts in parallel, while a global context vector allows the music to remain stylistically coherent through time. We perform quantitative evaluations to assess the quality of the generated samples and showcase options for user control in piano and techno music generation. We release the source code and pretrained autoencoder weights at github.com/marcoppasini/musika, such that a GAN can be trained on a new music domain with a single GPU in a matter of hours.
翻译:在这项工作中,我们引入了Musika音乐制作系统,这个系统可以在数百小时的音乐中使用单一消费者GPU进行培训,并允许在消费者CPU上以比实时多得多的速度生成任意长度的音乐。我们通过首先学习对立自动摄像师的光谱量和阶段的近似缩略语来实现这一目标,然后在特定音乐领域培训一个精准的Adversarial网络(GAN)。一个潜在的协调系统可以同时产生任意长的节录,而一个全球背景矢量器可以使音乐在时间上保持风格的一致性。我们进行定量评估,以评估生成的样品的质量,并展示在钢琴和技术音乐生成过程中用户控制的选择。我们在Gmusub/copasimasi上发布了源码并预先培训了Gamus/copika重量,在Gmusublika上可以对Gmus/copasimas。