Videos depict the change of complex dynamical systems over time in the form of discrete image sequences. Generating controllable videos by learning the dynamical system is an important yet underexplored topic in the computer vision community. This paper presents a novel framework, TiV-ODE, to generate highly controllable videos from a static image and a text caption. Specifically, our framework leverages the ability of Neural Ordinary Differential Equations~(Neural ODEs) to represent complex dynamical systems as a set of nonlinear ordinary differential equations. The resulting framework is capable of generating videos with both desired dynamics and content. Experiments demonstrate the ability of the proposed method in generating highly controllable and visually consistent videos, and its capability of modeling dynamical systems. Overall, this work is a significant step towards developing advanced controllable video generation models that can handle complex and dynamic scenes.
翻译:视频通过离散的图像序列展示了复杂动态系统随时间变化的过程。通过学习动态系统来生成可控视频是计算机视觉领域一个重要但少有研究的课题。该论文提出了一种新的框架,TiV-ODE,通过静态图像和文本标注生成高度可控的视频。具体而言,我们的框架利用了神经常微分方程(Neural ODE)代表非线性常微分方程组的能力。所得到的框架能够生成具有所需动力学和内容的视频。实验证明了该方法在生成高度可控和视觉一致的视频以及对于建模动态系统的能力。总的来说,这项工作是向着开发能够处理复杂和动态场景的高级可控视频生成模型迈出了重要的一步。