As a combination of visual and audio signals, video is inherently multi-modal. However, existing video generation methods are primarily intended for the synthesis of visual frames, whereas audio signals in realistic videos are disregarded. In this work, we concentrate on a rarely investigated problem of text guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals. Specifically, we present the SVG-VQGAN to transform visual frames and audio melspectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized representations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based decoder is used to model associations between texts, visual frames, and audio signals at token level for auto-regressive sounding video generation. AudioSetCap, a human annotated text-video-audio paired dataset, is produced for training SVG. Experimental results demonstrate the superiority of our method when compared with existing textto-video generation methods as well as audio generation methods on Kinetics and VAS datasets.
翻译:作为视觉和音频信号的组合,视频本质上是多模态的。然而,现有的视频生成方法主要用于合成视觉帧,而忽略了现实视频中的音频信号。在本文中,我们集中研究了文本引导声音视频生成的一个很少被研究的问题,并提出了声音视频生成器(SVG),这是一个生成与音频信号一起的逼真视频的统一框架。具体而言,我们提出了 SVG-VQGAN,将视频帧和音频 mel 频谱图转换为离散的令牌。SVG-VQGAN 应用了一种新型的混合对比学习方法来建模模态间和模态内的一致性,并改善量化表示。采用跨模态注意机制来提取与视觉帧和音频信号相关的特征进行对比学习。然后,使用基于 Transformer 的解码器来在令牌级别上建模文本、视觉帧和音频信号之间的联系,进行自回归声音视频生成。我们提供了一个由人类注释的文本-视频-音频配对数据集 AudioSetCap 来训练 SVG。实验结果表明,与现有的文本到视频生成方法以及 Kinetics 和 VAS 数据集上的音频生成方法相比,我们的方法具有更好的性能。