Videos can often be created by first outlining a global description of the scene and then adding local details. Inspired by this we propose a hierarchical model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, that is then refined by subsequent levels in the hierarchy. We train each level in our hierarchy sequentially on partial views of the videos. This reduces the computational complexity of our generative model, which scales to high-resolution videos beyond a few frames. We validate our approach on Kinetics-600 and BDD100K, for which we train a three level model capable of generating 256x256 videos with 48 frames.
翻译:视频通常可以通过首先概述对场景的全球描述,然后添加本地细节来创建。 受此启发, 我们提出了视频生成的等级模式, 遵循粗略到精细的方法。 首先, 我们的模型生成了一个低分辨率的视频, 建立全球场景结构, 然后由随后的层次进行完善。 我们按视频的部分观点对我们的层次进行分级培训。 这降低了我们的基因模型的计算复杂性, 将模型比喻为高分辨率视频, 超越了几个框架。 我们验证了我们对动因- 600 和 BDD100K 的处理方法, 我们为此培训了一个能生成有48个框架的256x256 视频的三级模型。