We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning. We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order. This classic work, however, was constrained by its use of hand-designed distance metrics, limiting its use to simple, repetitive videos. We draw on recent techniques from self-supervised learning to learn this distance metric, allowing us to compare frames in a manner that scales to more challenging dynamics, and to condition on other data, such as audio. We learn representations for video frames and frame-to-frame transition probabilities by fitting a video-specific model trained using contrastive learning. To synthesize a texture, we randomly sample frames with high transition probabilities to generate diverse temporally smooth videos with novel sequences and transitions. The model naturally extends to an audio-conditioned setting without requiring any finetuning. Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.
翻译:我们采用非参数方法,利用通过对比性学习所学的演示方法,对无限的视频质谱合成采用非参数性方法。我们从视频质谱中汲取灵感,它显示,通过在一个新颖而又一致的顺序中将一个视频框架合在一起,就可以从一个单一的图像中产生出可信的新视频。然而,这一经典工作由于使用手工设计的远程测量方法而受到限制,将它的使用限制在简单、重复的视频上。我们从自我监督的学习中汲取最新技术,学习这种远程测量方法,使我们能够比较框架,使其与更具挑战性的动态相比,并以音频等其他数据为条件。我们学习视频框架和框架到框架之间的过渡概率,我们通过使用经过对比性学习训练的视频特定模型来适应。为了合成一种纹理,我们随机抽样框架具有高度的过渡性概率,可以生成具有新顺序和过渡性的不同时间流畅的视频。模型自然延伸到一个不需作任何微调的音频调节的设置。我们的模型在人类感官分数上超越了基准,可以处理一系列不同的输入视频,并且可以将音频信号和声频信号与同步结合起来。