Over 60,000 songs are released on Spotify every day, and the competition for the listener's attention is immense. In that regard, the importance of captivating and inviting cover art cannot be underestimated, because it is deeply entangled with a song's character and the artist's identity, and remains one of the most important gateways to lead people to discover music. However, designing cover art is a highly creative, lengthy and sometimes expensive process that can be daunting, especially for non-professional artists. For this reason, we propose a novel deep-learning framework to generate cover art guided by audio features. Inspired by VQGAN-CLIP, our approach is highly flexible because individual components can easily be replaced without the need for any retraining. This paper outlines the architectural details of our models and discusses the optimization challenges that emerge from them. More specifically, we will exploit genetic algorithms to overcome bad local minima and adversarial examples. We find that our framework can generate suitable cover art for most genres, and that the visual features adapt themselves to audio feature changes. Given these results, we believe that our framework paves the road for extensions and more advanced applications in audio-guided visual generation tasks.
翻译:每天在Spotify上发布60,000多首歌曲,观众的注意力竞争非常激烈。在这方面,不可低估吸引和邀请封面艺术的重要性,因为它与歌曲的性格和艺术家的特性有着深刻的联系,并且仍然是引导人们发现音乐的最重要途径之一。然而,设计封面艺术是一个高度创造性、冗长、有时甚至昂贵的过程,特别是对于非专业艺术家来说,这个过程可能非常艰巨。为此,我们提议了一个新的深造框架,以产生受音频特点引导的封面艺术。在VQGAN-CLIP的启发下,我们的方法非常灵活,因为个别组成部分很容易被替换而无需再培训。本文概述了我们模型的建筑细节,并讨论了从他们身上产生的优化挑战。更具体地说,我们将利用基因算法来克服当地不良的微型和对抗性的例子。我们发现我们的框架可以为大多数基因产生适当的封面艺术,并且视觉特征能够适应音频特征的变化。鉴于这些结果,我们认为我们的框架为扩展和视听新一代的应用铺平铺平了道路。