The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method.
翻译:StyleGAN最近的成功表明,经过培训的StyleGAN潜伏空间对于现实的视频生成很有用。然而,由于StyleGAN潜伏空间的方向和规模难以确定,视频中产生的运动通常不具有语义意义。在本文件中,我们提出了一个框架,通过利用多式(声音模拟文字)嵌入空间生成现实的视频。由于声音提供了场景的时间背景,我们的框架学会了制作一个与声音相一致的视频。首先,我们的音频反演模块将音频直接映射到StyleGAN潜伏空间。我们随后将基于CLIP的多式联运嵌入空间进一步提供视听关系。最后,拟议框架生成器学习如何在与相应声音相容的潜在空间中找到轨迹,并以等级方式生成视频。我们提供了新的高分辨率景观视频数据集(视听配对)用于音频制视频生成任务。实验显示,我们的模型在视频质量方面超越了最新艺术方法。我们进一步展示了包括图像和图像编辑在内的几种应用方法。