Creation of images using generative adversarial networks has been widely adapted into multi-modal regime with the advent of multi-modal representation models pre-trained on large corpus. Various modalities sharing a common representation space could be utilized to guide the generative models to create images from text or even from audio source. Departing from the previous methods that solely rely on either text or audio, we exploit the expressiveness of both modality. Based on the fusion of text and audio, we create video whose content is consistent with the distinct modalities that are provided. A simple approach to automatically segment the video into variable length intervals and maintain time consistency in generated video is part of our method. Our proposed framework for generating music video shows promising results in application level where users can interactively feed in music source and text source to create artistic music videos. Our code is available at https://github.com/joeljang/music2video.
翻译:利用基因对抗网络制作图像的做法已广泛应用于多种模式制度,因为多模式代表模式的出现是经过大规模预先培训的,可以使用共同代表空间的各种模式来指导从文字甚至音频来源生成图像的基因模型。我们从以前完全依赖文字或音频的方法出发,利用两种模式的表达性。我们根据文本和音频的结合,制作内容与所提供的不同模式一致的视频。我们的方法之一是自动将视频分解成不同长度间隔,并在制作的视频中保持时间一致性。我们提议的制作音乐视频框架显示在应用层面上的有希望的结果,用户可以在音乐源和文本源中互动提供制作艺术音乐视频。我们的代码可在https://github.com/joeljang/LES2view上查阅。