Learning to predict the long-term future of video frames is notoriously challenging due to inherent ambiguities in the distant future and dramatic amplifications of prediction error through time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to destruction in structure and content. In this work, we revisit hierarchical models in video prediction. Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation. Despite the simplicity, we show that modeling structures and their dynamics in the discrete semantic structure space with a stochastic recurrent estimator leads to surprisingly successful long-term prediction. We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon (i.e., thousands frames), setting a new standard of video prediction with orders of magnitude longer prediction time than existing approaches. Full videos and codes are available at https://1konny.github.io/HVP/.
翻译:尽管文献最近有所进步,但现有方法仅限于中度短期预测(短短几秒钟),而将其推到更长远的未来,导致结构和内容的破坏。在这项工作中,我们在视频预测中重新审视等级模型。我们的方法通过先估计一个语义结构序列,然后通过视频到视频翻译将结构转换成像素来预测未来框架。尽管如此简单,我们还是展示了在离散的语义结构空间建模结构及其动态,并有一个随机的经常性估测器,导致出乎意料的长期预测。我们评估了我们有关汽车驾驶和人类舞蹈的三种具有挑战性的数据集的方法,并证明它可以在非常长的时间范围内(即千个框架)产生复杂的场景结构和运动,并设定新的视频预测标准,其数量要长于现有方法。在https://1kony.github.io/HPV中可以找到完整的视频和代码。