Videos can be created by first outlining a global view of the scene and then adding local details. Inspired by this idea we propose a cascaded model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, which is then refined by subsequent cascade levels operating at larger resolutions. We train each cascade level sequentially on partial views of the videos, which reduces the computational complexity of our model and makes it scalable to high-resolution videos with many frames. We empirically validate our approach on UCF101 and Kinetics-600, for which our model is competitive with the state-of-the-art. We further demonstrate the scaling capabilities of our model and train a three-level model on the BDD100K dataset which generates 256x256 pixels videos with 48 frames.
翻译:视频可以通过首先概述全局景象,然后添加本地细节来创建。 受这个想法的启发, 我们提出一个视频生成的级联模式, 遵循粗略到精细的方法。 首先, 我们的模式产生一个低分辨率的视频, 建立全球场景结构, 然后由随后在更大分辨率上运行的级联级别加以完善。 我们根据视频的局部视图对每级级联进行连续培训, 从而降低我们模型的计算复杂性, 并使它能够向多个框架的高清晰度视频进行缩放。 我们的经验验证了我们在UCF101和动因- 600上的做法, 我们的模型与最新技术相比具有竞争力。 我们进一步展示了我们模型的缩放能力, 并在BDD100K数据集上培养了三级模型, 以48个框架生成256x256像素视频。