Current video generation models usually convert signals indicating appearance and motion received from inputs (e.g., image, text) or latent spaces (e.g., noise vectors) into consecutive frames, fulfilling a stochastic generation process for the uncertainty introduced by latent code sampling. However, this generation pattern lacks deterministic constraints for both appearance and motion, leading to uncontrollable and undesirable outcomes. To this end, we propose a new task called Text-driven Video Prediction (TVP). Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. Specifically, appearance and motion components are provided by the image and caption separately. The key to addressing the TVP task depends on fully exploring the underlying motion information in text descriptions, thus facilitating plausible video generation. In fact, this task is intrinsically a cause-and-effect problem, as the text content directly influences the motion changes of frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM), producing step-wise embeddings to regulate motion inference for subsequent frames. In particular, a refinement mechanism incorporating global motion semantics guarantees coherent generation. Extensive experiments are conducted on Something-Something V2 and Single Moving MNIST datasets. Experimental results demonstrate that our model achieves better results over other baselines, verifying the effectiveness of the proposed framework.
翻译:当前视频生成模型通常将显示从投入(如图像、文本)或潜在空间(如噪声矢量)收到的外观和运动信号转换成连续框架,完成潜在代码取样带来的不确定性的随机生成过程;然而,这一生成模式缺乏对外观和运动的决定性限制,导致无法控制和不可取的结果。为此,我们提议一项新的任务,即“文本驱动视频预测”(TVP)。以第一个框架和文字说明作为投入,这一任务旨在综合以下框架。具体地说,图像和标题分别提供外观和运动组成部分。处理TVP任务的关键取决于充分探索文本描述中的基本动作信息,从而便利于合理的视频生成。事实上,这一任务本质上是一个因果关系问题,因为文字内容直接影响着框架的动作变化。为了调查文本对渐进运动信息有因果关系的推断能力,我们的TVP框架包含一个文字推导模模模模模模模模模模模模模模模模模模模模模模模模模模模模模,产生一个用于调节随后框架的动动动动动动和动动动动构图像的分校模模模模框架的关键。具体的改进机制是,一个不断改进的实验性实验性机制,将其他的实验性实验性实验性实验结果。