As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code will be released.
翻译:作为图像合成任务的自然延伸,视频合成最近吸引了许多人的兴趣。许多图像合成工作使用类标签或文本作为指导。 但是, 标签和文本都不能提供明确的时间指导, 如动作开始或结束时。 为了克服这一限制, 我们引入了语义视频场景图作为视频合成的输入, 因为它们代表了现场天体之间的空间和时间关系。 由于视频场景图通常是时间上不相连的附加说明, 我们提议了一个视频场景图( VSGG) 编码器, 不仅将现有的视频场景图解码, 而且还预测了未标框的图形示意图。 VSGS 编码器和文本的图像显示器事先经过训练, 具有不同的对比性多模式损失。 一个语义图像场景图- 图像合成框架( SSGVVS), 以经过预先训练的 VSGG coder, VQ- VAE, 和自动递增变变变变变器为基础, 提议对视频场景图进行合成, 将SSGVVS 和其他状态图像源集合成模型展示。