We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence. A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency, such as a single latent code that dictates content for the entire video. On the other extreme, without long-term consistency, generated videos may morph unrealistically between different scenes. To address these limitations, we prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos. To this end, we leverage a two-phase training strategy, where we separately train using longer videos at a low resolution and shorter videos at a high resolution. To evaluate the capabilities of our model, we introduce two new benchmark datasets with explicit focus on long-term temporal dynamics.
翻译:我们提出了一个视频生成模型,准确复制物体运动、相机观点的变化以及随时间变化产生的新内容。现有的视频生成方法往往不能产生新的内容,而不能作为时间函数产生新的内容,同时维持真实环境中预期的组合,例如表面动态和对象持久性。一个常见的失败案例是内容永远不会改变,因为过度依赖感应偏差,从而提供时间一致性,例如一个单一的潜在代码,要求整个视频的内容。在另一个极端方面,没有长期一致性,制作的视频可能会在不同场景之间发生不切实际的变化。为了解决这些局限性,我们通过重新设计时间潜值代表和通过长期视频培训从数据中学习长期一致性,确定时间轴的优先次序。为此,我们利用一个两阶段培训战略,在低分辨率和高分辨率上分别使用更长的视频进行远程培训。为了评估我们的模型的能力,我们引入了两个新的基准数据集,明确侧重于长期时间动态。